• Business Essentials
  • Leadership & Management
  • Entrepreneurship & Innovation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

data mining project cycle

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

8 Steps in the Data Life Cycle

Business professional analyzing data as part of the data life cycle

Whether you manage data initiatives, work with data professionals, or are employed by an organization that regularly conducts data projects, a firm understanding of what the average data project looks like can prove highly beneficial to your career. This knowledge—paired with other data skills —is what many organizations look for when hiring.

No two data projects are identical; each brings its own challenges, opportunities, and potential solutions that impact its trajectory. Nearly all data projects, however, follow the same basic life cycle from start to finish. This life cycle can be split into eight common stages, steps, or phases:

Below is a walkthrough of the processes that are typically involved in each of them.

Access your free e-book today.

Data Life Cycle Stages

The data life cycle is often described as a cycle because the lessons learned and insights gleaned from one data project typically inform the next. In this way, the final step of the process feeds back into the first.

Data Life Cycle

1. Generation

For the data life cycle to begin, data must first be generated. Otherwise, the following steps can’t be initiated.

Data generation occurs regardless of whether you’re aware of it, especially in our increasingly online world. Some of this data is generated by your organization, some by your customers, and some by third parties you may or may not be aware of. Every sale, purchase, hire, communication, interaction— everything generates data. Given the proper attention, this data can often lead to powerful insights that allow you to better serve your customers and become more effective in your role.

Back to top

2. Collection

Not all of the data that’s generated every day is collected or used. It’s up to your data team to identify what information should be captured and the best means for doing so, and what data is unnecessary or irrelevant to the project at hand.

You can collect data in a variety of ways, including:

It’s important to note that many organizations take a broad approach to data collection, capturing as much data as possible from each interaction and storing it for potential use. While drawing from this supply is certainly an option, it’s always important to start by creating a plan to capture the data you know is critical to your project.

3. Processing

Once data has been collected, it must be processed . Data processing can refer to various activities, including:

Even the simple act of taking a printed form and digitizing it can be considered a form of data processing.

After data has been collected and processed, it must be stored for future use. This is most commonly achieved through the creation of databases or datasets. These datasets may then be stored in the cloud, on servers, or using another form of physical storage like a hard drive, CD, cassette, or floppy disk.

When determining how to best store data for your organization, it’s important to build in a certain level of redundancy to ensure that a copy of your data will be protected and accessible, even if the original source becomes corrupted or compromised.

5. Management

Data management , also called database management, involves organizing, storing, and retrieving data as necessary over the life of a data project. While referred to here as a “step,” it’s an ongoing process that takes place from the beginning through the end of a project. Data management includes everything from storage and encryption to implementing access logs and changelogs that track who has accessed data and what changes they may have made.

6. Analysis

Data analysis refers to processes that attempt to glean meaningful insights from raw data. Analysts and data scientists use different tools and strategies to conduct these analyses. Some of the more commonly used methods include statistical modeling, algorithms, artificial intelligence, data mining, and machine learning.

Exactly who performs an analysis depends on the specific challenge being addressed, as well as the size of your organization’s data team. Business analysts, data analysts, and data scientists can all play a role.

7. Visualization

Data visualization refers to the process of creating graphical representations of your information, typically through the use of one or more visualization tools . Visualizing data makes it easier to quickly communicate your analysis to a wider audience both inside and outside your organization. The form your visualization takes depends on the data you’re working with, as well as the story you want to communicate.

While technically not a required step for all data projects, data visualization has become an increasingly important part of the data life cycle.

8. Interpretation

Finally, the interpretation phase of the data life cycle provides the opportunity to make sense of your analysis and visualization. Beyond simply presenting the data, this is when you investigate it through the lens of your expertise and understanding. Your interpretation may not only include a description or explanation of what the data shows but, more importantly, what the implications may be.

Other Frameworks

The eight steps outlined above offer an effective framework for thinking about a data project’s life cycle. That being said, it isn’t the only way to think about data. Another commonly cited framework breaks the data life cycle into the following phases:

While this framework's phases use slightly different terms, they largely align with the steps outlined in this article.

A Beginner's Guide to Data & Analytics | Access Your Free E-Book | Download Now

The Importance of Understanding the Data Life Cycle

Even if you don’t directly work with your organization’s data team or projects, understanding the data life cycle can empower you to communicate more effectively with those who do. It can also provide insights that allow you to conceive of potential projects or initiatives.

The good news is that, unless you intend to transition into or start a career as a data analyst or data scientist, it’s highly unlikely you’ll need a degree in the field. Several faster and more affordable options for learning basic data skills exist, such as online courses.

Are you interested in improving your data science and analytical skills? Learn more about our online course Business Analytics , or download the Beginner’s Guide to Data & Analytics to learn how you can leverage the power of data for professional and organizational success.

data mining project cycle

About the Author

data mining project cycle

Data Science Central

The Data Science Project Life Cycle Explained

OlhaZhydik

8819705454

As Covid-19 continues to shape the global economy, analytics and business intelligence (BI) projects can help organisations prepare and implement strategies to navigate the crisis. According to the Covid-19 Impact Survey by Dresner Advisory Services, most respondents believe that  data-driven decision-making   is crucial to survive and thrive during the pandemic and beyond. This article provides a step-by-step overview of the typical data science project life cycle, including some best practices and expert advice.

Results of a survey by O€™Reilly show that enterprises stabilise their adoption patterns for artificial intelligence (AI) across a wide variety of functional areas.

AI adoption in the enterprise 2020

Source:  AI adoption in the enterprise 2020

The same survey shows that 53% of enterprises using AI today recognise unexpected outcomes and predictions as the greatest risk when building and deploying  machine learning (ML) models .

Being an executive person driving and overseeing data science adoption in your organisation, what can you do to achieve a reliable outcome of your data modelling project while getting the best ROI and mitigating security risks at the same time?

The answer lies in thorough project planning and expert execution at every stage of the data science project life cycle. Whether you use your in-house resources or outsource your project to an external team of data scientists, you should:

Here€™s our rundown of a data science project life cycle, including the six main steps of the cross-industry standard process for data mining (CRISP-DM) and additional steps from data science solutions that are essential parts of every data science project. This roadmap is based on decades of experience in delivering data modelling and analysis solutions for a range of business domains, including e-commerce, retail, fashion and finance. It will help you avoid critical mistakes from the start and ensure smooth rollout and model deployment down the line.

data science project life cycle

A typical data science project life cycle step by step

1. Ideation and initial planning

Without a valid idea and a comprehensive plan in place, it is difficult to align your model with your business needs and project goals to judge all of its strengths, its scope and the challenges involved. First, you need to understand what business problems and requirements you have and how they can be addressed with a data science solution.

At this stage, we often recommend that businesses run a feasibility study €“ exhaustive research that allows you to define your goals for a solution and then build the team best equipped to deliver it. There are usually several other software development life cycle (SDLC) steps that will run in parallel with data modelling, including solution design, software development, testing,  DevOps activities and more . The planning stage is to ensure you have all required roles and skills in your team to make the project run smoothly through all of its stages, meet its purpose and achieve its desired progress within the given time limit.

2. Side SDLC activities: design, software development and testing

As you kick off your data analysis and modelling project, several other activities usually run in parallel as parts of the SDLC. These include product design, software development, quality assurance activities and more. Here, team collaboration and alignment are key to project success.

For your model to be deployed as a ready-to-use solution, you need to make sure that your team is aligned through all the software development stages. It€™s essential for your data scientists to work closely with other development team members, especially with product designers and DevOps, to ensure your solution has an easy-to-use interface and that all of the features and functionality your data model provides are integrated there in the way that€™s most convenient to the user. Your DevOps engineers will also play an important role in deciding how the model will be integrated within your real production environment, as it can be deployed as a microservice, which facilitates scaling, versioning and security.

When the product is subject to quality assurance activities, the model gets tested within the team€™s staging environment and by the customer.

3. Business understanding: Identifying your problems and business needs, strategy and roadmap creation

The importance of understanding your business needs, and the availability and nature of data, can€™t be underestimated. Every data science project should be €˜business first€™, hence defining business problems and objectives from the outset.

And in the initial phase of a data science project, companies should also set the key performance indicators and criteria that will be indicative of project success. After defining your business objectives, you should assess the data you have at your disposal and what industry/market data is available and how usable it is.

The most important task within the business understanding stage is to define whether the problem can be solved by the available or state-of-the-art modelling and analysis approaches. The second most important task is to understand the domain, which allows data scientists to define new model features, initiate model transformations and come up with improvement recommendations.

4. Data understanding: data acquisition and exploratory data analysis

The preceding stages were intended to help you define your criteria for data science project success. Having those available, your data science team will be able to prepare your data for analysis and recommend which data to use and how.

The better the data you use, the better your model is. So, an initial analysis of data should provide some guiding insights that will help set the tone for modelling and further analysis. Based on your business needs, your data scientists should understand how much data you need to build and train the model.

How can you tell good data from bad data? Data quality is imperative, but how are you to know if your information really isn€™t up to the required standard? Here are some of the €˜red flags to watch out for:

Types of data that can be analysed include financial statements, customer and market demand data, supply chain and manufacturing data, text corpora video and audio, image datasets, as well as time series, logs and signals.

Some types of data are a lot more costly and time-consuming to collect and label properly than others; the process can take even longer than the modelling itself. So, you need to understand how much cost is involved, how much effort is needed and what outcome you can expect, as well as your potential ROI before you make a hefty investment in the project.

5. Data preparation and preprocessing

Once you€™ve established your goals, gained a clear understanding of the data needed and acquired the data, you can move on to data preprocessing. The best method for this depends on the nature of the data you have: there are, for example, different time and cost ramifications for text and image data.

It€™s a pivotal stage, and your data scientists need to tread carefully when they€™re assessing data quality. If there are data values missing and your data scientists use a statistical approach to fill in the gaps, it could ultimately compromise the quality of your modelling results. Your data scientists should be able to evaluate data completeness and accuracy, spot noisy data and ask the right questions to fill any gaps, but it€™s essential to engage domain experts, for consultancy.

Data acquisition is usually done through an Extract, Transform and Load (ETL) pipeline.

The Data Science Project Life Cycle: ETL pipeline

The ETL (Extract, Transform and Load) pipeline

ETL is a process of data integration that includes three steps that combine information from various sources. The ETL approach is usually applied to create a data warehouse. The information is extracted from a source, transformed into a specific format for further analysis and loaded into a data warehouse.

The main purpose of data preprocessing is to transform information from images, audio, log, and other sources into numerical, normalised, and scaled values. Another aim of data preparation is to cleanse the information. It€™s possible that your data is usable; it just serves no outlined purpose. In such a case, 70%-80% of total modelling time €¯may be assigned to data cleansing or replacing data samples that are missing or contradictory.

In many situations, you may need additional feature extraction from your data (like calculating the square from the room width and length for the rent price estimation).

Proper preparation from kick-off will ensure that your data science project gets off on the right foot, with the right goals in mind. An initial data assessment can outline how to prepare your data for further modelling.

6. Modelling

We advise that you start from proof of concept (PoC) development, where you can validate initial ideas before your team starts pre-testing on your real-world data. After you€™ve validated your ideas with a PoC, you can safely proceed to production model creation.

Define the modelling technique

Even though you may have chosen a tool at the business understanding stage, the modelling stage begins with choosing the specific modelling technique you€™ll use. At this stage, you generate a number of models that are set up, built and can be trained. ML models €” linear regression, KNN, Ensembles, Random Forest, etc. €” and deep learning models €“ RNN, LSTN and GANs €“ are part of this step.

Come up with a test design

Before model creation, the testing method or system should be developed to review the quality and validity. Let€™s take classification as a data mining task. Error rates can be used as quality measures; thus, you can separate datasets in train, validation sets. And build the model using a train set and make a quality assessment based on the separate test set (a validation set is used for the model/approach selection, not for the final error/accuracy measurement).

Build a model

To develop one or more models, use the modelling tool on the arranged dataset.

7. Model evaluation

The Data Science Project Life Cycle: model selection during prototyping phase

Model selection during the prototyping phase

To assess the model, leverage your domain knowledge, criteria of data mining success and desired test design. After evaluating the success of the modelling application, work together with business analysts and domain experts to review the data mining results in the business context.

Include business objectives and business success criteria at this point. Usually, data mining projects implement a technique several times, and data mining results are obtained by many different methods.

Here are some methods used by data scientists to check a model€™s accuracy:

The Data Science Project Life Cycle: the confusion matrix

The confusion matrix

The assessment method should fit your business objectives. When you turn back to preprocessing to check your approach, you can use different preprocessing techniques, extract some other features and then turn back to the modelling stage. You can also do factor analysis to check how your model reacts to different samples.

8. Deployment: Real-world integration and model monitoring

When the model has passed the validation stage, and you and your stakeholders are 100% happy with the results, only then you can move on to full-scale development €“ integrating the model within your real production environment. The role of engineers like DevOps, MLOps and DB is very important at this stage.

The model consists of a set of scripts that process data from databases, data lakes and file systems (CSV, XLS, URLs), using APIs, ports, sockets or other sources. You€™ll need some technical expertise to find your way around the models.

Alternatively, you could have a custom user interface built, or have the model integrated with your existing systems for convenience and ease of use. This is easily done via microservices and other methods of integration. Once validation and deployment are complete, your data science team and business leaders need to step back and assess the project€™s overall success.

9. Data model monitoring and maintenance

A data science project doesn€™t end with the deployment stage; the maintenance step comes next. Data changes from day to day, so a monitoring system is needed to track the model€™s performance over time.

Once the model€™s performance falls down, monitoring systems can indicate whether a failure needs to be handled, or whether a model should be retrained, or even whether a new model should be implemented. The main purpose of maintenance is to ensure a system€™s full functionality and optimal performance until the end of its working life.

10. Data model disposition

Data disposition is the last stage in the data science project life cycle, consisting of either data or model reuse/repurpose or data/model destruction. Once the data gets reused or repurposed, your data science project life cycle becomes circular. Data reuse means using the same information several times for the same purpose, while data repurpose means using the same data to serve more than one purpose.

Data or model destruction, on the other hand, means complete information removal. To erase the information, among other things, you can overwrite it or physically destroy the carrier. Data destruction is critical to protect privacy, and failure to delete information may lead to breaches, compliance problems among other issues.

AI will keep shaping the establishment of new business, financial and operating models in 2021 and beyond. The investments of world-leading companies will affect the global economy and its workforce and are likely to define new winners and losers.

The lack of AI-specific skills remains a primary obstacle on the way to adoption in the majority of organisations. In the survey by O€™Reilly, around 58% of respondents typically mentioned the shortage of ML modellers and data scientists, among other skill gaps within their organisations.

AI adoption in the enterprise 2020

Source: AI adoption in the enterprise 2020

Having questions on how your data can be used to help you boost your business performance?  We will be happy to answer them. Drop us a line.

Originally published at  ELEKS Labs blog . 

Related Content

'  data-srcset=

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.

Welcome to the newly launched Education Spotlight page! View Listings

What Is Data Mining?

What is data mining.

Data mining is a computer-assisted technique used in analytics to process and explore large data sets. With data mining tools and methods, organizations can discover hidden patterns and relationships in their data. Data mining transforms raw data into practical knowledge. Companies use this knowledge to solve problems, analyze the future impact of business decisions, and increase their profit margins.

What does the term data mining mean?

“Data mining” is a misnomer because the goal of data mining is not to extract or mine the data itself. Instead, a large amount of data is already present, and data mining extracts meaning or valuable knowledge from it. The typical process of data collection, storage, analysis, and mining is outlined below.

Why is data mining important?

Data mining is a crucial part of any successful analytics initiative. Businesses can use the knowledge discovery process to increase customer trust, find new sources of revenue, and keep customers coming back. Effective data mining aids in various aspects of business planning and operations management. Below are some examples of how different industries use data mining.

Telecom, media, and technology

High-competition verticals like telecom, media, and technology use data mining to improve customer service by finding patterns in customer behavior. For example, a company could analyze bandwidth usage patterns and provide customized service upgrades or recommendations.

Banking and insurance

Financial services can use data mining applications to solve complex fraud, compliance, risk management, and customer attrition problems. For example, insurance companies can discover optimal product pricing by comparing past product performance with competitor pricing.

Education providers can use data mining algorithms to test students, customize lessons, and gamify learning. Unified, data-driven views of student progress can help educators see what students need and support them better.

Manufacturing

Manufacturing services can use data mining techniques to provide real-time and predictive analytics for overall equipment effectiveness, service levels, product quality, and supply chain efficiency. For example, manufacturers can use historical data to predict the wear of production machinery and anticipate maintenance. As a result, they can optimize production schedules and reduce downtime.

Retail companies have large customer databases with raw data about customer purchase behavior. Data mining can process this data to derive relevant insights for marketing campaigns and sales forecasts. Through more accurate data models, retail companies can optimize sales and logistics for increased customer satisfaction. For example, data mining can reveal popular seasonal products that can be stocked in advance to avoid last-minute shortages.

How does data mining work?

The Cross-Industry Standard Process for Data Mining (CRISP-DM) is an excellent guideline for starting the data mining process. CRISP-DM is both a methodology and a process model that is industry, tool, and application neutral.

What are the six phases of the data mining process?

Using the flexible CRISP-DM phases, data teams can move back and forth between stages as needed. Also, software technologies can do some of these tasks or support them.

1. Business understanding

The data scientist or data miner starts by identifying project objectives and scope. They collaborate with business stakeholders to identify certain information.

They then use this information to define data mining goals and identify the resources required for knowledge discovery.

2. Data understanding

Once they understand the business problem, data scientists begin preliminary analysis of the data. They gather data sets from various sources, obtain access rights, and prepare a data description report. The report includes the data types, quantity, and hardware and software requirements for data processing. Once the business has approved their plan, they begin exploring and verifying the data. They manipulate the data using basic statistical techniques, assess the data quality, and choose a final data set for the next stage.

3. Data preparation

Data miners spend the most time on this phase because data mining software requires high-quality data. Business processes collect and store data for reasons other than mining, and data miners must refine it before using it for modeling. Data preparation involves the following processes.

Clean the data 

For example, handle missing data, data errors, default values, and data corrections.

Integrate the data

For example, combine two disparate data sets to get the final target data set.

Format the data

For example, convert data types or configure data for the specific mining technology being used.

4. Data modeling

Data miners input the prepared data into the data mining software and study the results. To do this, they can choose from multiple data mining techniques and tools. They must also write tests to assess the quality of data mining results. To model the data, data scientists can:

5. Evaluation

After creating the models, data miners start measuring them against the original business goals. They share the results with business analysts and collect feedback. The model might answer the original question well or show new and previously unknown patterns. Data miners can change the model, adjust the business goal, or revisit the data, depending on the business feedback. Continual evaluation, feedback, and modification are part of the knowledge discovery process.

6. Deployment

During deployment, other stakeholders use the working model to generate business intelligence. The data scientist plans the deployment process, which includes teaching others about the model functions, continually monitoring, and maintaining the data mining application. Business analysts use the application to create reports for management, share results with customers, and improve business processes.

What are the techniques for data mining?

Data mining techniques draw from various fields of learning that overlap, including statistical analysis, machine learning (ML), and mathematics. Some examples are given below.

Association rule mining

Association rule mining is the process of finding relationships between two different, seemingly unrelated data sets. If-then statements demonstrate the probability of a relationship between two data points. Data scientists measure result accuracy using support and confidence criteria. Support measures how frequently the related elements appear in the data set, while confidence shows the number of times an if-then statement is accurate.

For example, when customers buy an item, they also often buy a second related item. Retailers can use association mining on past purchase data to identify a new customer's interest. They use data mining results to populate the recommended sections of online stores.

Classification

Classification is a complex data mining technique that trains the ML algorithm to sort data into distinct categories. It uses statistical methods like decision trees and nearest-neighbor to identify the category. In all these methods, the algorithm is preprogrammed with known data classifications to guess the type of a new data element.

For example, analysts can train the data mining software by using labeled images of apples and mangoes. With some accuracy, the software can then predict if a new picture is an apple, mango, or other fruit.

Clustering is grouping multiple data points together based on their similarities. It is different from classification because it cannot distinguish the data by specific category but can find patterns in their similarities. The data mining result is a set of clusters where each collection is distinct from other groups, but the objects in each cluster are similar in some way.

For example, cluster analysis can help with market research when working with multivariate data from surveys. Market researchers use cluster analysis to divide consumers into market segments and better understand the relationships between different groups.

Sequence and path analysis

Data mining software can also look for patterns in which a particular set of events or values leads to later ones. It can recognize some variation in data that happens at regular intervals or in the ebb and flow of data points over time.

For example, a business might use path analysis to discover that certain product sales spike just before the holidays or to notice that warmer weather brings more people to its website.

What are the types of data mining?

Depending on the data and the purpose of mining, data mining can have various branches or specializations. Let's look at some of them below.

Process Mining

Process mining is a branch of data mining that aims to discover, monitor, and improve business processes. It extracts knowledge from event logs that are available in information systems. It helps organizations see and understand what's happening in these processes from day to day.

For example, e-commerce businesses have many processes, like procurement, sales, payments, collection, and shipping. By mining their procurement data logs, they might see that their supplier delivery reliability is 54% or that 12% of suppliers are consistently delivering early. They can use this information to optimize their supplier relationships.

Text mining

Text mining or text data mining is using data mining software to read and comprehend text. Data scientists use text mining to automate knowledge discovery in written resources like websites, books, emails, reviews, and articles.

For example, a digital media company could use text mining to automatically read comments on its online videos and classify audience reviews as positive or negative.

Predictive Mining

Predictive data mining uses business intelligence to predict trends. It helps business leaders study the impact of their decisions on the company’s future and make effective choices.

For example, a company might look at past product returns data to design a warranty scheme that does not lead to losses. Using predictive mining, they will predict the potential number of returns in the coming year and create a one-year warranty plan that considers the loss when determining the product price.

How can AWS help with data mining?

Amazon SageMaker is a leading data mining software platform. It helps data miners and developers prepare, build, train, and deploy high-quality machine learning (ML) models. It includes several tools for the data mining process.

Get started with data mining by creating a free AWS account today.

Data Mining With AWS Next Steps

data mining project cycle

Ending Support for Internet Explorer

Documentation / Reference

Data Mining - (Life cycle|Project|Data Pipeline)

Table of contents, articles related, observation against perturbation, data preparation, a model is dynamic, pitfall / pratfall.

Data mining is an experimental science.

Data mining reveals correlation, not causation .

From data to information ( patterns , or expectations, that underlie them)

Any data scientist worth their salary will say you should start with a question, NOT the data, @JakePorway
Most #bigdata problems can be addressed by proper sampling/filtering and running models on a single (perhaps large) machine … Chris Volinsky

(Statistics|Probability|Machine Learning|Data Mining|Data and Knowledge Discovery|Pattern Recognition|Data Science|Data Analysis)

The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively Fred Mosteller and John Tukey, paraphrasing George Box

In other words, if you want to make a causal statement about a predictor for an outcome, you actually have to be able to take the system and perturb that particular predictor keeping the other ones fixed.

That will allow you to make a causal statement about a predictor variable and its effect on the outcome. It's not good enough simply to observe some observations from the system. Data from this observation can't conclude to causality .

So in order to know what happens when a complex system is perturbed, it must be perturbed not only observed.

The following paragraph must be merged in one.

Learning is iterative:

The phases of solving a business problem using Data Mining are as follows:

For a Supervised problem:

Cross Industry Standard Process Model for Data Mining

The Cross Industry Standard Process Model for Data Mining (CRISP-DM). From: An Oracle White Paper - February 2013 - Information Management and Big Data A Reference Architecture

https://eng.uber.com/michelangelo/ 6 steps:

* Evaluate models

When Google rolled out flu stories in Google News, people started reading about flu in the news and searching on those stories and that skewed their results. During the period from 2011 to 2013, it overestimated the prevalence of flu (factor of two in 2012 and 2013). They needed to take this new factor into account.

Google Flu Trends teaches us that the modelling process cannot be static, but rather we must periodically revist the process and understand what underlying factors, if any, may have changed.

Recommended Pages

Share this page:

Data (State) Data (State) DataBase Data Processing Data Quality Data Structure Data Type Data Warehouse Data Visualization Data Partition Data Persistence Data Concurrency

Data Science Data Analysis Statistics Data Science Linear Algebra Mathematics Trigonometry

Modeling Process Logical Data Modeling Relational Modeling Dimensional Modeling Automata

Data Type Number Time Text Collection Relation (Table) Cube Tree Key/Value Graph Spatial Color Log

Measure Levels Order Nominal Discrete Distance Ratio

Code Compiler Lexical Parser Grammar Function Testing Debugging Shipping Data Type Versioning Design Pattern

Infrastructure Operating System Monitoring Cryptography Security File System Network Process (Thread) Computer Infra As Code

Marketing Advertising Analytics Email

Web Html Dom Http Url Css Javascript Selector Browser Web Services OAuth

Contact [email protected] Privacy Policy Status

data mining project cycle

Introduction to Life Cycle of Data Science projects (Beginner Friendly)

This article was published as a part of the  Data Science Blogathon

As a data scientist aspirant, you must be keen to understand how the life cycle of data science projects works so that it’s easier for you to implement your individual projects in a similar pattern. Today, we will be basically discussing the step-by-step implementation process of any data science project in a real-world scenario.

What is a Data Science Project Lifecycle?

In simple terms, a data science life cycle is nothing but a repetitive set of steps that you need to take to complete and deliver a project/product to your client. Although the data science projects and the teams involved in deploying and developing the model will be different, every data science life cycle will be slightly different in every other company. However, most of the data science projects happen to follow a somewhat similar process.

In order to start and complete a data science-based project, we need to understand the various roles and responsibilities of the people involved in building, developing the project. Let us take a look at those employees who are involved in a typical data science project:

Who Are Involved in The Projects:

Now that we have an idea of who all are involved in a typical business project, let’s understand what is a data science project and how do we define the life cycle of the data science project in a real-world scenario like a fake news identifier.

Why do we need to define the Life Cycle of a data science project?

Data Science Project Lifecycle 1

In a normal case, a Data Science project contains data as its main element. Without any data, we won’t be able to do any analysis or predict any outcome as we are looking at something unknown. Hence, before starting any data science project that we have got from either our clients or stakeholder first we need to understand the underlying problem statement presented by them. Once we understand the business problem, we have to gather the relevant data that will help us in solving the use case. However, for beginners many questions arise like:

In what format do we need the data?

How to get the data?

What do we need to do with data?

So many questions yet answers might vary from person to person. Hence in order to address all these concerns right away, we do have a pre-defined flow that is termed as Data Science Project Life Cycle. The process is fairly simple wherein the company has to first gather data, perform data cleaning, perform EDA to extract relevant features, preparing the data by performing feature engineering and feature scaling. In the second phase, the model is built and deployed after a proper evaluation. This entire lifecycle is not a one man’s job, for this, you need the entire team to work together to get the work done by achieving the required amount of efficiency for the project

The globally accepted structure in resolving any sort of analytical problem is popularly known as Cross Industry Standard Process for Data Mining or abbreviated as CRISP-DM framework .

Life Cycle of a Typical Data Science Project Explained:

Data Science Project Lifecycle life

1) Understanding the Business Problem:

In order to build a successful business model, its very important to first understand the business problem that the client is facing. Suppose he wants to predict the customer churn rate of his retail business. You may first want to understand his business, his requirements and what he is actually wanting to achieve from the prediction. In such cases, it is important to take consultation from domain experts and finally understand the underlying problems that are present in the system. A Business Analyst is generally responsible for gathering the required details from the client and forwarding the data to the data scientist team for further speculation. Even a minute error in defining the problem and understanding the requirement may be very crucial for the project hence it is to be done with maximum precision.

After asking the required questions to the company stakeholders or clients, we move to the next process which is known as data collection.

2) Data Collection

After gaining clarity on the problem statement, we need to collect relevant data to break the problem into small components.

The data science project starts with the identification of various data sources, which may include web server logs, social media posts, data from digital libraries such as the US Census datasets, data accessed through sources on the internet via APIs, web scraping, or information that is already present in an excel spreadsheet. Data collection entails obtaining information from both known internal and external sources that can assist in addressing the business issue.

Normally, the data analyst team is responsible for gathering the data. They need to figure out proper ways to source data and collect the same to get the desired results.

There are two ways to source the data:

Data Science Project Lifecycle data collection

3) Data Preparation

After gathering the data from relevant sources we need to move forward to data preparation. This stage helps us gain a better understanding of the data and prepares it for further evaluation.

Additionally, this stage is referred to as Data Cleaning or Data Wrangling. It entails steps such as selecting relevant data, combining it by mixing data sets, cleaning it, dealing with missing values by either removing them or imputing them with relevant data, dealing with incorrect data by removing it, and also checking for and dealing with outliers. By using feature engineering, you can create new data and extract new features from existing ones. Format the data according to the desired structure and delete any unnecessary columns or functions. Data preparation is the most time-consuming process, accounting for up to 90% of the total project duration, and this is the most crucial step throughout the entire life cycle.

Exploratory Data Analysis (EDA) is critical at this point because summarising clean data enables the identification of the data’s structure, outliers, anomalies, and trends. These insights can aid in identifying the optimal set of features, an algorithm to use for model creation, and model construction.

4) Data Modeling

Throughout most cases of data analysis, data modeling is regarded as the core process. In this process of data modeling, we take the prepared data as the input and with this, we try to prepare the desired output.

We first tend to select the appropriate type of model that would be implemented to acquire results, whether the problem is a regression problem or classification, or a clustering-based problem. Depending on the type of data received we happen to choose the appropriate machine learning algorithm that is best suited for the model. Once this is done, we ought to tune the hyperparameters of the chosen models to get a favorable outcome.

Finally, we tend to evaluate the model by testing the accuracy and relevance. In addition to this project, we need to make sure there is a correct balance between specificity and generalizability, which is the created model must be unbiased.

Data Modeling

5) Model Deployment

Before the model is deployed, we need to ensure that we have picked the right solution after a rigorous evaluation has been. Later on, it is then deployed in the desired channel and format. This is naturally the last step in the life cycle of data science projects. Please take extra caution before executing each step in the life cycle to avoid unwanted errors. For example, if you choose the wrong machine learning algorithm for data modeling then you will not achieve the desired accuracy and it will be difficult in getting approval for the project from the stakeholders. If your data is not cleaned properly, you will have to handle missing values or the noise present in the dataset later on. Hence, in order to make sure that the model is deployed properly and accepted in the real world as an optimal use case, you will have to do rigorous testing in every step.

All the steps mentioned above are equally applicable for beginners as well as seasoned data science practitioners. As a beginner, your job is to learn the process first, then you need to practice and deploy smaller projects like fake news detector, titanic dataset, etc. You can refer to portals like analyticsvidhya.com , kaggle.com , hackerearth.com to get the dataset and start working on it.

Luckily for beginners, these portals have already cleaned most of the data, and hence proceeding with the next steps will be fairly easy. However, in the real world, you have to acquire not just any data set but the data that might meet the requirements of your data science project. Hence, initially, your job is to first proceed with all the steps of the data science life cycle very sincerely, and once you are thorough with the process and deployment you are ready to take the next step towards a career in this field. Python and R are the two languages that are most widely used in data science use cases.

Nowadays, even Julia is becoming one of the preferred languages for deploying the model. However, along with the clarity in the process, you should be comfortable in coding via such languages. From process understanding to proficiency in the programming language, you need to be adept with all.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 

data mining project cycle

About the Author

Ananya Chakraborty

Ananya Chakraborty

Our top authors.

Rahul Shah

Download Analytics Vidhya App for the Latest blog/Article

Leave a reply your email address will not be published. required fields are marked *.

Notify me of follow-up comments by email.

Notify me of new posts by email.

Top Resources

data mining project cycle

30 Best Data Science Books to Read in 2023

data mining project cycle

How to Read and Write With CSV Files in Python:..

data mining project cycle

Understand Random Forest Algorithms With Examples (Updated 2023)

data mining project cycle

Feature Selection Techniques in Machine Learning (Updated 2023)

Welcome to India's Largest Data Science Community

Back welcome back :), don't have an account yet register here, back start your journey here, already have an account login here.

A verification link has been sent to your email id

If you have not recieved the link please goto Sign Up page again

back Please enter the OTP that is sent to your registered email id

Back please enter the otp that is sent to your email id, back please enter your registered email id.

This email id is not registered with us. Please enter your registered email id.

back Please enter the OTP that is sent your registered email id

Please create the new password here, privacy overview.

Data Mining Tutorial

Data Mining Project Cycle - Data Mining

Data Mining Interview Questions

What is the life cycle of a data mining project? What are the challenging steps? Who should be involved in a data mining project? To answer these questions, let’s go over a typical data mining project step by step.

Step 1: Data Collection The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied. For instance, if you want to analyze the Web click stream and your company has a dozen Web servers, the first step is to download the Web log data from each Web server.

Sometimes you might be lucky. The data warehouse on the subject of your analysis already exists. However, the data in the data warehouse may not be rich enough. You may still need to gather data from other sources. Suppose that there is a click stream data warehouse containing all the Web clicks on the Web site of your company. You have basic information about customers’ navigation patterns. However, because there is not much demographic information about your Web visitors, you may need to purchase or gather some demographic data from other sources in order to build a more accurate model.

After the data is collected, you can sample the data to reduce the volume of the training dataset. In many cases, the patterns contained in 50,000 customers are the same as in 1 million customers.

Step 2: Data Cleaning and Transformation Data cleaning and transformation is the most resource-intensive step in a data mining project. The purpose of data cleaning is to remove noise and irrelevant information out of the dataset. The purpose of data transformation is to modify the source data into different formats in terms of data types and values. There are various techniques you can apply to data cleaning and transformation, including:

Data type transform: This is the simplest data transform. An example is transforming a Boolean column type to integer. The reason for this transform is that some data mining algorithms perform better on integerdata, while others prefer Boolean data.

Continuous column transform: For continuous data such as that in Income and Age columns, a typical transform is to bin the data into buckets. For example, you may want to bin Age into five predefined agegroups. Apart from binning, techniques such as normalization are popular for transforming continuous data. Normalization maps all numerical values to a number between 0 and 1 (or –1 to 1) to ensure that largenumbers do not dominate smaller numbers during the analysis

Grouping: Sometimes there are too many distinct values (states) for a discrete column. You need to group these values into a few groups to reduce the model’s complexity. For example, the column Profession mayhave tens of different values such as Software Engineer, Telecom Engineer, Mechanical Engineer, Consultant, and so on. You can group various engineering professions by using a single value: Engineer. Groupingalso makes the model easier to interpret.

Aggregation: Aggregation is yet another important transform. Suppose that there is a table containing the telephone call detail records (CDR) for each customer, and your goal is to segment customers based on theirmonthly phone usage. Since the CDR information is too detailed for the model, you need to aggregate all the calls into a few derived attributes such as total number of calls and the average call duration. These derived attributes can later be used in the model.

Missing value handling: Most datasets contain missing values. There are a number of causes for missing data. For instance, you may have two customer tables coming from two OLTP databases. Merging these tables can result in missing values, since table definitions are not exactly the same. In another example, your customer demographic table may have a column for age. But customers don’t always like to give you this information during the registration. You may have a table of daily closing values for the stock MSFT. Because the stock market closes on weekends, there will be null values for those dates in the table. Addressing missing values is an important issue. There are a few ways to deal with this problem. You may replace the missing values with the most popular value (constant). If you don’t know a customer’s age, you can replace it with the average age of all the customers. When a record has too many missing values, you may simply remove it. For more advanced cases, you can build a mining model using those complete cases, and then apply the model to predict the most likely value for each missing case.

Removing outliers: Outliers are abnormal cases in a dataset. Abnormal cases affect the quality of a model. For example, suppose that you want to build a customer segmentation model based on customer telephoneusage (average duration, total number of calls, monthly invoice, international calls, and so on) There are a few customers (0.5%) who behave very differently. Some of these customers live aboard and use roamingall the time. If you include those abnormal cases in the model, you may end up by creating a model with majority of customers in one segment and a few other very small segments containing only these outliers.

The best way to deal with outliers is to simply remove them before the analysis. You can remove outliers based on an individual attribute; for instance, removing 0.5% customers with highest or lowest income. Youmay remove outliers based on a set of attributes. In this case, you can use a clustering algorithm. Many clustering algorithms, including Microsoft Clustering, group outliers into a few particular clusters.

There are many other data-cleaning and transformation techniques, and there are many tools available in the market. SQL Server Integration Services (SSIS) provides a set of transforms covering most of the tasks listed here.

Step 3: Model Building Once the data is cleaned and the variables are transformed, we can start to build models. Before building any model, we need to understand the goal of the data mining project and the type of the data mining task. Is this project a classification task, an association task or a segmentation task? In this stage, we need to team up with business analysts with domain knowledge. For example, if we mine telecom data, we should team up with marketing people who understand the telecom business.

Model building is the core of data mining, though it is not as time- and resource-intensive as data transformation. Once you understand the type of data mining task, it is relatively easy to pick the right algorithms. For each data mining task, there are a few suitable algorithms. In many cases, you won’t know which algorithm is the best fit for the data before model training. The accuracy of the algorithm depends on the nature of the data such as the number of states of the predictable attribute, the value distribution of each attribute, the relationships among attributes, and so on. For example, if the relationship among all input attributes and predictable attributes were linear, the decision tree algorithm would be a very good choice. If the relationships among attributes are more complicated, then the neural network algorithm should be considered.

The correct approach is to build multiple models using different algorithms and then compare the accuracy of these models using some tool, such as a lift chart, which is described in the next step. Even for the same algorithm, you may need to build multiple models using different parameter settings in order to fine-tune the model’s accuracy.

Step 4: Model Assessment In the model-building stage, we build a set of models using different algorithms and parameter settings. So what is the best model in terms of accuracy? How do you evaluate these models? There are a few popular tools to evaluate the quality of a model. The most well-known one is the lift chart. It uses a trained model to predict the values of the testing dataset. Based on the predicted value and probability, it graphically displays the model in a chart.

In the model assessment stage, not only do you use tools to evaluate the model accuracy but you also need to discuss the meaning of discovered patterns with business analysts. For example, if you build an association model on a dataset, you may find rules such as Relationship = Husband => Gender = Male with 100% confidence . Although the rule is valid, it doesn’t contain any business value. It is very important to work with business analysts who have the proper domain knowledge in order to validate the discoveries. Sometimes the model doesn’t contain useful patterns. This may occur for a couple of reasons. One is that the data is completely random. While it is possible to have random data, in most cases, real datasets do contain rich information. The second reason, which is more likely, is that the set of variables in the model is not the best one to use. You may need to repeat the data-cleaning and transformation step in order to derive more meaningful variables. Data mining is a cyclic process; it usually takes a few iterations to find the right model.

Step 5: Reporting Reporting is an important delivery channel for data mining findings. In many organizations, the goal of data miners is to deliver reports to the marketing executives. Most data mining tools have reporting features that allow users to generate predefined reports from mining models with textual or graphic outputs. There are two types of reports: reports about the findings (patterns) and reports about the prediction or forecast.

Step 6: Prediction (Scoring) In many data mining projects, finding patterns is just half of the work; the final goal is to use these models for prediction. Prediction is also called scoring in data mining terminology. To give predictions, we need to have a trained model and a set of new cases. Consider a banking scenario in which you have built a model about loan risk evaluation. Every day there are thousands of new loan applications. You can use the risk evaluation model to predict the potential risk for each of these loan applications.

Step 7: Application Integration Embedding data mining into business applications is about applying intelligence back to business, that is, closing the analysis loop. According to Gartner Research, in the next few years, more and more business applications will embed a data mining component as a value-added. For example, CRM applications may have data mining features that group customers into segments. ERP applications may have data mining features to forecast production. An online bookstore can give customers real-time recommendations on books. Integrating data mining features, especially a real-time prediction component into applications is one of the important steps of data mining projects. This is the key step for bringing data mining into mass usage.

Step 8: Model Management It is challenging to maintain the status of mining models. Each mining model has a life cycle. In some businesses, patterns are relatively stable and models don’t require frequent retraining. But in many businesses patterns vary frequently. For example, in online bookstores, new books appear every day. This means that new association rules appear every day. The duration of a mining model is limited. Anew version of the model must be created frequently. Ultimately, determining the model’s accuracy and creating new versions of the model should be accomplished by using automated processes. Like any data, mining models also have security issues. Mining models contain patterns. Many of these patterns are the summary of sensitive data. We need to maintain the read, write, and prediction rights for different user profiles. Mining models should be treated as first-class citizens in a database, where administrators can assign and revoke user access rights to these models.

Data Mining Tutorial

Data Mining Interview Questions

Data Mining Practice Tests

List of Tutorials

List of Topics

wisdomjobs

Skills by Location

Jobs By Companies

Jobs in Andhra Pradesh

Jobs in Assam

Jobs in Chhattisgarh

Jobs in Gujarat

Jobs in Haryana

Jobs in Jharkhand

Jobs in Kerala

Jobs in Karnataka

Jobs in Uttarakhand

Jobs in Madhya Pradesh

Jobs in Odisha

Jobs in Rajasthan

Jobs in Punjab

Jobs in Tamil Nadu

Jobs in Telangana

Jobs in Uttar Pradesh

Jobs in West Bengal

Jobs in Maharashtra

Jobs in Himachal Pradesh

Jobs in Jammu Kashmir

Jobs in Meghalaya

Jobs in Goa

Jobs in Nagaland

State Govt Jobs

Defence Jobs

Railway Jobs

Latest walkins

Walkins by Skill

Walkins by location

Walkins by Company

POPULAR COURSES

Management Skills

Communication Skills

Business Skills

Digital Marketing Skills

Human Resources Skills

Health Care Skills

Finance Skills

All Courses

All Practice Tests

Resume Writing Tips

Interview Tips

Career Tips

DMCA.com Protection Status

Wisdomjobs.com is one of the best job search sites in India.

Data Mining Project Cycle

What is the life cycle of a data mining project? What are the challenging steps?

Who should be involved in a data mining project? To answer these questions, let’s go over a typical data mining project step by step.

Step 1: Data Collection

The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied. For instance, if you want to analyze the Web click stream and your company has a dozen Web servers, the first step is to download the Web log data from each Web server.

Sometimes you might be lucky. The data warehouse on the subject of your analysis already exists. However, the data in the data warehouse may not be rich enough. You may still need to gather data from other sources. Suppose that there is a click stream data warehouse containing all the Web clicks on the Web site of your company. You have basic information about customers’ navi-gation patterns. However, because there is not much demographic informa-tion about your Web visitors, you may need to purchase or gather some demographic data from other sources in order to build a more accurate model.

After the data is collected, you can sample the data to reduce the volume of the training dataset. In many cases, the patterns contained in 50,000 customers are the same as in 1 million customers.

Step 2: Data Cleaning and Transformation

Data cleaning and transformation is the most resource-intensive step in a data mining project. The purpose of data cleaning is to remove noise and irrelevant information out of the dataset. The purpose of data transformation is to mod-ify the source data into different formats in terms of data types and values.

There are various techniques you can apply to data cleaning and transforma-tion, including:

Data type transform: This is the simplest data transform. An example is transforming a Boolean column type to integer. The reason for this trans-form is that some data mining algorithms pertrans-form better on integer data, while others prefer Boolean data.

Continuous column transform: For continuous data such as that in Income and Age columns, a typical transform is to bin the data into

buckets. For example, you may want to bin Age into five predefined age groups. Apart from binning, techniques such as normalization are popu-lar for transforming continuous data. Normalization maps all numerical values to a number between 0 and 1 (or –1 to 1) to ensure that large numbers do not dominate smaller numbers during the analysis.

Grouping: Sometimes there are too many distinct values (states) for a discrete column. You need to group these values into a few groups to reduce the model’s complexity. For example, the column Profession may have tens of different values such as Software Engineer, Telecom Engi-neer, Mechanical EngiEngi-neer, Consultant, and so on. You can group vari-ous engineering professions by using a single value: Engineer. Grouping also makes the model easier to interpret.

Aggregation: Aggregation is yet another important transform. Suppose that there is a table containing the telephone call detail records (CDR) for each customer, and your goal is to segment customers based on their monthly phone usage. Since the CDR information is too detailed for the model, you need to aggregate all the calls into a few derived attributes such as total number of calls and the average call duration. These derived attributes can later be used in the model.

Missing value handling: Most datasets contain missing values. There are a number of causes for missing data. For instance, you may have two customer tables coming from two OLTP databases. Merging these tables can result in missing values, since table definitions are not exactly the same. In another example, your customer demographic table may have a column for age. But customers don’t always like to give you this infor-mation during the registration. You may have a table of daily closing values for the stock MSFT. Because the stock market closes on weekends, there will be null values for those dates in the table. Addressing missing values is an important issue. There are a few ways to deal with this problem. You may replace the missing values with the most popular value (constant). If you don’t know a customer’s age, you can replace it with the average age of all the customers. When a record has too many missing values, you may simply remove it. For more advanced cases, you can build a mining model using those complete cases, and then apply the model to predict the most likely value for each missing case.

Removing outliers: Outliers are abnormal cases in a dataset. Abnormal cases affect the quality of a model. For example, suppose that you want to build a customer segmentation model based on customer telephone usage (average duration, total number of calls, monthly invoice, interna-tional calls, and so on) There are a few customers (0.5%) who behave

very differently. Some of these customers live aboard and use roaming all the time. If you include those abnormal cases in the model, you may end up by creating a model with majority of customers in one segment and a few other very small segments containing only these outliers.

The best way to deal with outliers is to simply remove them before the analysis. You can remove outliers based on an individual attribute; for instance, removing 0.5% customers with highest or lowest income. You may remove outliers based on a set of attributes. In this case, you can use a clustering algorithm. Many clustering algorithms, including Microsoft Clustering, group outliers into a few particular clusters.

There are many other data-cleaning and transformation techniques, and there are many tools available in the market. SQL Server Integration Services (SSIS) provides a set of transforms covering most of the tasks listed here.

Step 3: Model Building

Once the data is cleaned and the variables are transformed, we can start to build models. Before building any model, we need to understand the goal of the data mining project and the type of the data mining task. Is this project a classification task, an association task or a segmentation task? In this stage, we need to team up with business analysts with domain knowledge. For example, if we mine telecom data, we should team up with marketing people who understand the telecom business.

Model building is the core of data mining, though it is not as time- and resource-intensive as data transformation. Once you understand the type of data mining task, it is relatively easy to pick the right algorithms. For each data mining task, there are a few suitable algorithms. In many cases, you won’t know which algorithm is the best fit for the data before model training. The accuracy of the algorithm depends on the nature of the data such as the num-ber of states of the predictable attribute, the value distribution of each attribute, the relationships among attributes, and so on. For example, if the relationship among all input attributes and predictable attributes were linear, the decision tree algorithm would be a very good choice. If the relationships among attrib-utes are more complicated, then the neural network algorithm should be considered.

The correct approach is to build multiple models using different algorithms and then compare the accuracy of these models using some tool, such as a lift chart, which is described in the next step. Even for the same algorithm, you may need to build multiple models using different parameter settings in order to fine-tune the model’s accuracy.

Step 4: Model Assessment

In the model-building stage, we build a set of models using different algo-rithms and parameter settings. So what is the best model in terms of accuracy?

How do you evaluate these models? There are a few popular tools to evaluate the quality of a model. The most well-known one is the lift chart. It uses a trained model to predict the values of the testing dataset. Based on the pre-dicted value and probability, it graphically displays the model in a chart. We will give a better description of lift charts in Chapter 3.

In the model assessment stage, not only do you use tools to evaluate the model accuracy but you also need to discuss the meaning of discovered pat-terns with business analysts. For example, if you build an association model on a dataset, you may find rules such as Relationship = Husband => Gender = Male with 100% confidence. Although the rule is valid, it doesn’t contain any business value. It is very important to work with business analysts who have the proper domain knowledge in order to validate the discoveries.

Sometimes the model doesn’t contain useful patterns. This may occur for a couple of reasons. One is that the data is completely random. While it is possi-ble to have random data, in most cases, real datasets do contain rich informa-tion. The second reason, which is more likely, is that the set of variables in the model is not the best one to use. You may need to repeat the data-cleaning and transformation step in order to derive more meaningful variables. Data min-ing is a cyclic process; it usually takes a few iterations to find the right model.

Step 5: Reporting

Reporting is an important delivery channel for data mining findings. In many organizations, the goal of data miners is to deliver reports to the marketing executives. Most data mining tools have reporting features that allow users to generate predefined reports from mining models with textual or graphic out-puts. There are two types of reports: reports about the findings (patterns) and reports about the prediction or forecast.

Step 6: Prediction (Scoring)

In many data mining projects, finding patterns is just half of the work; the final goal is to use these models for prediction. Prediction is also called scoring in data mining terminology. To give predictions, we need to have a trained model and a set of new cases. Consider a banking scenario in which you have built a model about loan risk evaluation. Every day there are thousands of new loan applications. You can use the risk evaluation model to predict the potential risk for each of these loan applications.

Step 7: Application Integration

Embedding data mining into business applications is about applying intelli-gence back to business, that is, closing the analysis loop. According to Gartner Research, in the next few years, more and more business applications will embed a data mining component as a value-added. For example, CRM appli-cations may have data mining features that group customers into segments.

ERP applications may have data mining features to forecast production. An online bookstore can give customers real-time recommendations on books.

Integrating data mining features, especially a real-time prediction component into applications is one of the important steps of data mining projects. This is the key step for bringing data mining into mass usage.

Step 8: Model Management

It is challenging to maintain the status of mining models. Each mining model has a life cycle. In some businesses, patterns are relatively stable and models don’t require frequent retraining. But in many businesses patterns vary fre-quently. For example, in online bookstores, new books appear every day. This means that new association rules appear every day. The duration of a mining model is limited. A new version of the model must be created frequently. Ulti-mately, determining the model’s accuracy and creating new versions of the model should be accomplished by using automated processes.

Like any data, mining models also have security issues. Mining models con-tain patterns. Many of these patterns are the summary of sensitive data. We need to maintain the read, write, and prediction rights for different user pro-files. Mining models should be treated as first-class citizens in a database, where administrators can assign and revoke user access rights to these models.

Related documents

Related Articles

Traditional Data Mining Life Cycle (Crisp Methodology)

Prerequisite – Data Mining Traditional Data Mining Life Cycle: The data life cycle is the arrangement of stages that a specific unit of information goes through from its starting era or capture to its possible documented and/or cancellation at the conclusion of its valuable life. This cycle has shallow likenesses with the more conventional information mining cycle as depicted in Crisp methodology. Steps Traditional Data Mining Life Cycle:

Please Login to comment...

Improve your Coding Skills with Practice

Start your coding journey now.

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

The Team Data Science Process lifecycle

The Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure your data-science projects. The lifecycle outlines the complete steps that successful projects follow. If you use another data-science lifecycle, such as the Cross Industry Standard Process for Data Mining (CRISP-DM) , Knowledge Discovery in Databases (KDD) , or your organization's own custom process, you can still use the task-based TDSP.

This lifecycle is designed for data-science projects that are intended to ship as part of intelligent applications. These applications deploy machine learning or artificial intelligence models for predictive analytics. Exploratory data-science projects and improvised analytics projects can also benefit from the use of this process. But for those projects, some of the steps described here might not be needed.

Five lifecycle stages

The TDSP lifecycle is composed of five major stages that are executed iteratively. These stages include:

Here is a visual representation of the TDSP lifecycle:

TDSP lifecycle

The TDSP lifecycle is modeled as a sequence of iterated steps that provide guidance on the tasks needed to use predictive models. You deploy the predictive models in the production environment that you plan to use to build the intelligent applications. The goal of this process lifecycle is to continue to move a data-science project toward a clear engagement end point. Data science is an exercise in research and discovery. The ability to communicate tasks to your team and your customers by using a well-defined set of artifacts that employ standardized templates helps to avoid misunderstandings. Using these templates also increases the chance of the successful completion of a complex data-science project.

For each stage, we provide the following information:

For examples of how to execute steps in TDSPs that use Azure Machine Learning, see Use the TDSP with Azure Machine Learning .

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Related resources

Submit and view feedback for

Additional resources

data mining project cycle

Towards Data Science

Israel Rodrigues

Feb 17, 2020

Member-only

CRISP-DM methodology leader in data mining and big data

A short step by step guide of the machine learning methodology.

In March 2015, I collaborated on a paper, called “Methodological Business proposals for the Development of Big Data Projects” [2], together with Alberto Cavadia, and Juan Gómez. Back then, we realized that big data projects usually have 7 parts.

Shortly after, I used CRISP-DM methodology for my thesis because it was an open standard, widely used [3] on markets, and (thanks to previous paper) I knew it was quite similar to other approaches.

As my data layer professional career develops, I can’t avoid noticing that CRISP-DM methodology stills quite relevant. Actually, data management units and IT profiles are built around the steps of this methodology. So I decided, to dedicate a short story, to describe the steps of the long winning methodology.

CRISP-DM stands for Cross Industry Standard Process for Data Mining and is a 1996 methodology created to shape Data Mining projects. It consists of 6 steps to conceive a Data Mining project and they can have cycle iterations according to developers’ needs. Those steps are Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.

The first step is Business Understanding and its objective is to give context to the goals and to the data so that the developer/engineer gets a notion of the relevance of data in that particular business model.

It is composed of meetings, online meetings, documentation reading, specific field learning, and a long list of ways they help the development team, make questions about relevant context.

The product of this step is that the development team understands the context of the project. The goals of the project should be defined before the project starts. For example, develop team should know by now that the objective is to increase sales, and after the step is over, understand what is the client selling and how they sell it.

The second step is Data Understanding and its objective is to know what can be expected and achieved from the data. It checks the quality of the data, in several terms, such as data completeness, values distributions, data governance compliance.

This is a crucial part of the project because it defines how viable and trustworthy can be the final results. In this step, team members brainstorm on how to extract the best value of the pieces of information. In case, the use or relevance of some piece of data is unclear to the development team, they can momentarily step back, to understand the business and how it benefits from that piece of information.

Thanks to this step data scientist now know-how, on terms of data, the result should satisfy the goals of the project, what algorithm and process bring that result, how is the current state of the data, and how it should be, in order to be useful to the algorithm and process involved.

The third step is Data Preparation and involves the ETLs or ELTs process that turns the pieces of data into something useful by the algorithms and process.

Sometimes data governance policies are not respected or set in an organization, and in order to give true meaning to data, it becomes data engineers and data scientists’ job to standardize the information.

Likewise, some algorithms perform better under certain parameters, someone doesn’t accept no-numerical values, others don’t work ok with a large variance on values. Then again, it is up to the development team to normalize information.

Most of the projects spent the majority of their time on this step. This step, I believe, is the reason there’s an IT profile call data engineer. As is time-consuming, that can get really complex when working with large amounts of data, IT departments could find an advantage in dedicating resources to specifically perform these duties.

The fourth step is Modeling and is the core of any machine learning project. This step is responsible for the results that should satisfy or help satisfied the project goals.

Although is the glamorous part of the project, it is also the shortest in time, as if everything previous is done correctly, there is little to adjust. In case, the results are improvable, the methodology is set to step back to data preparation and improve the available data.

Some algorithms such as k-means, hierarchical clustering, time series, linear regression, k-nearest neighbors, an amount many several others, are the core code lines of this step in the methodology.

The fifth step is Evaluation where it is up to verify that the results are valid and correct. In case the results are wrong, the methodology permits the review back to the first step, in order to understand why the results are mistaken.

Usually, on a data science project, the data scientist, divide the data into training and testing. On this step the testing data is used, its objective is to verify that the model (product of the modeling step) is accurate to the reality.

Depending on the task and the context, there are diverse techniques. For example on the context of supervised learning, with the task of classifying items, one way to verify the results is with the confusion matrix. For unsupervised learning, to make evaluation becomes harder, as there is none static value to separate “correct” from “incorrect”, for example, the task of classifying items would be evaluated by calculating the inter and intra distance between elements in a(some) cluster(s).

In any case, it is important to specify some source of error measure. This error measure tells the user how can they have confidence in the results, either for: “for sure this will work” or “for sure it won’t”. If somehow the error measure happens to by 0 or none for all cases, it would indicate that the model its overfit, and reality might perform differently.

The sixth and last step is Deployment and it consists of present the results in a useful and understandable manner, and by achieving this, the project should achieve its goals. It is the only step not belonging to a cycle.

Depending on the final user a useful and understandable manner might vary. For example, if the final user is another piece of software, as in the sales website program asking its recommendation system what to suggest for a buyer, a useful manner would be a JSON carrying the response to a specific query. In another case, like a top executive who requires projected information for decision making, the best manner to present the findings is to store then in an analytical database and present them as a dashboard on a business intelligence solution.

I decided to write this short description/explanation because I’m surprised by the long relevance of the methodology. This methodology has been there for a long time, and it seems like it would prevail longer.

This methodology is quite logical and forwards on its steps. As it evaluates all aspect on a data mining project and allows circles on its execution, so is robust and trust gainer. It is no surprise that most developers and project managers choose it and that the alternative methodologies are quite similar.

I hope this short introduction, to help IT professionals to give argumentation on the methodological development of their tasks. Several other areas of informatics can read this story and get a basic understanding of what data scientists are doing, and how it relates to other profiles such as data engineer and business intelligence.

I hope you have enjoyed it as this is my first story :).

References:

[1] algedroid, Team, Work, Business, Cooperation (2019), URL: https://pixabay.com/photos/team-work-business-cooperation-4503157/

[2] Alberto Cavadia, Juan Gómez, e Israel Rodríguez, Propuestas Empresariales Metodológicas para el Desarrollo de Proyectos de Big Data(2015), PAPER DE CIENCIAS DE DATOS

[3] Gregory Piatetsky , CRISP-DM, still the top methodology for analytics, data mining, or data science projects (2014), URL: https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html

[4] Kenneth Jensens, Process diagram showing the relationship between the different phases of CRISP-DM (2012), URL: https://es.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining#/media/Archivo:CRISP-DM_Process_Diagram.png

More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

About Help Terms Privacy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store

Israel Rodrigues

Data layer IT professional

Text to speech

Article Categories

Book categories, collections.

Phases of the Data Mining Process

Data mining for dummies.

Book image

Sign up for the Dummies Beta Program to try Dummies' newest way to learn.

The Cross-Industry Standard Process for Data Mining ( CRISP-DM ) is the dominant data-mining process framework. It's an open standard; anyone may use it. The following list describes the various phases of the process.

Business understanding: Get a clear understanding of the problem you're out to solve, how it impacts your organization, and your goals for addressing it. Tasks in this phase include:

Identifying your business goals

Assessing your situation

Defining your data mining goals

Producing your project plan

Data understanding: Review the data that you have, document it, identify data management and data quality issues. Tasks for this phase include:

Gathering data

Verifying quality

Data preparation: Get your data ready to use for modeling. Tasks for this phase include:

Selecting data

Cleaning data

Constructing

Integrating

Modeling: Use mathematical techniques to identify patterns within your data. Tasks for this phase include:

Selecting techniques

Designing tests

Building models

Assessing models

Evaluation: Review the patterns you have discovered and assess their potential for business use. Tasks for this phase include:

Evaluating results

Reviewing the process

Determining the next steps

Deployment: Put your discoveries to work in everyday business. Tasks for this phase include:

Planning deployment (your methods for integrating data mining discoveries into use)

Reporting final results

Reviewing final results

About This Article

This article is from the book:.

About the book author:

Meta S. Brown helps organizations use practical data analysis to solve everyday business problems. A hands-on data miner who has tackled projects with up to $900 million at stake, she is a recognized expert in cutting-edge business analytics.

This article can be found in the category:

Data Science Process Alliance

What is SEMMA?

The SAS Institute developed SEMMA as the process of data mining. It has five steps ( S ample, E xplore, M odify, M odel, and A ssess), earning the acronym of SEMMA. You can use the SEMMA data mining methodology to solve a wide range of business problems, including fraud identification, customer retention and turnover, database marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as risk, affinity, and portfolio analysis.

Businesses use the SEMMA methodology on their data mining and machine learning projects to achieve a competitive advantage, improve performance, and deliver more useful services to customers. The data we collect about our surroundings serve as the foundation for hypotheses and models of the world we live in.

Ultimately, data is accumulated to help in collecting knowledge. That means the data is not worth much until it is studied and analyzed. But hoarding vast volumes of data is not equivalent to gathering valuable knowledge. It is only when data is sorted and evaluated that we learn anything from it.

Thus, SEMMA is designed as a data science methodology to help practitioners convert data into knowledge.

The 5 Stages Of SEMMA

SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS to be associated with their SAS Enterprise Miner initiative. While it is true that the SEMMA process is more ambiguous to those not using the tool, most regard it as a functional data mining methodology rather than a specific tool.

The process breaks down into its own set of stages. These include:

Don’t Miss Out on the Latest

Sign up for the Data Science Project Manager’s Tips to learn 4 differentiating factors to better manage data science projects. Plus, you’ll get monthly updates on the latest articles, research, and offers.

How Popular is SEMMA?

In four polls spanning from 2002 to 2014 from KDnuggets.com, respondents selected SEMMA 7 – 13% of the time. While significantly less than CRISP-DM , this represents the second most commonly selected pre-defined framework.

data science methodology poll

We conducted a similar poll on this site in 2020. SEMMA was only selected by a single person. This is not a true comparison to KDnuggets’ polls as our audience likely has different demographics and our result options and question were different.

most popular data science processes

However, anecdotally, we don’t encounter many practitioners who have even heard of SEMMA. And given its myopic focus (as discussed in the next section), SEMMA likely has fallen out of favor with more modern and comprehensive data science methodologies .

SEMMA vs KDD Process vs CRISP-DM

The CRoss Industry Standard Process in Data Mining ( CRISP-DM ) and the Knowledge Discovery in Databases ( KDD ) Process are two similar data mining life cycles.

In comparing KDD and SEMMA, on a high level the parallels draw themselves. The Sample stage is relatively comparable to KDD’s Selection, and both the Pre-processing and Explore phases achieve the same basic function in their respective processes.

The Modification stage, much like the Transformation KDD equivalent is responsible for refining sorted data from the stage before it, and the Modeling phase is a loose equivalent to Data Mining (as defined by KDD) in the sense that it is when the collected, selected, and refined data is brought together through various tests in order to test the derived knowledge and illustrate it more visually. Finally, the Assess step of SEMMA is a near direct equivalent to the KDD’s evaluation phase, where the data mining/modeling results are tested for their efficacy, and previously unknown findings are funneled back to refine the cyclical process.

SEMMA is a rather myopic approach toward data science projects. It does its job at explaining the core technical steps of a machine learning life cycle . However, as data science projects enter mainstream organizations, a more comprehensive approach is needed. Some good starting points include:

Or to truly master data science project management, consider earning the Data Science Team Lead certification.

Curious? Read our White Paper

Learn the five unique challenges of data science projects and how to overcome them.

Get a grasp on CRISP-DM, Scrum, and Data Driven Scrum.

And understand how to leverage best practices to deliver data science outcomes.

data science project management - defining a better data science process

Related Posts

KDD and Data Mining

Thank you for your interest in a DSPA course!

Please fill out the form below as a first step towards course registration.

data mining project cycle

Finally… A Field Guide for Managing Data Science Projects

Data science projects are unique. It’s time to start managing them as such.

Get the jumpstart guide to better manage your next project.

Please enable JavaScript in your browser to complete this form. Email * GET IT NOW!

Last Updated on January 31, 2023

The stages of mining: 5 lifecycle processes explained

data mining project cycle

When looking at mining stocks, it's easy to only focus on the finished product that you are investing your money in. Whether that's uranium, gold, silver, palladium or any other natural resource, it is necessary to understand the full extraction process in order to really appreciate the asset. 

Just like no two diamonds are the same, neither are two mining projects.

Every billion-dollar project varies in some way (location, commodity, size) but there are 5 key stages that all miners follow that form the backbone of mine development.  

The 5 Lifecycle Stages of Mining

Crux Investor 5 stages of mining lifecycle infographic

1. Exploration & Prospecting Stage 

Crux Investor 5 stages of mining lifecycle

This is the first and most essential step of the mining process: in order to open a mine, companies must first find an economically sufficient amount of the deposit (an amount of ore or mineral that makes exploitation worthwhile.)

Geologists are enlisted by the companies to understand the characteristics of the land to identify the presence of mineral deposits. 

What is a geologist? 

A geologist studies the solid, liquid, and gaseous matter of the Earth as well as the processes that shape them. A mining geologist is responsible for mapping out the locations of valuable minerals and will use aerial photographs, field maps, and geophysical surveys, to determine where valuable materials are and estimate how much of those materials are in that location.

Exploration geologists search for mineral resources and get involved in the planning and expansion of mining operations. They locate and evaluate potential deposits of precious metals, industrial minerals, gemstones, pigments, construction materials or other minable commodities.

Crux Investor 5 stages of mining lifecycle

What mining techniques are used by geologists?

Geological surface mapping and sampling .

A Geologist will record all geological information from the rocks that outcrop at the surface and will look for boundaries between different rock types and structures, look for fault-lines and evidence of the rocks undergoing deformation. The geologist will look for ore minerals, evidence of metal-rich fluids passing through the rock, and recording mineralised veins and their distribution. 

Mining companies need to target and prioritise their drilling activity so will use this data to target more specific areas where rock and mineral sampling might be appropriate. High-resolution geological mapping can also delineate areas of likely mineralisation which will lead to potential deposits.

Geophysical measurements

Geophysical measurements are taken for mineral exploration to collect information about the physical properties of rocks and sediments. Geophysical companies employ the use of magnetic, radiometric, electromagnetic and gravity surveys to detect responses which may indicate the presence of mineral deposits.

Exploration geophysics is used to detect the type of mineralisation, by measuring its physical properties. It is used to map the subsurface structure of a region, to understand the underlying structures, the spatial distribution of rock units, and to detect structures such as faults, folds and intrusive rocks. 

Geochemical analysis 

A chemical analysis that determines the proportion of metallic or non-metallic presence in a sample is called an assay. A wide variety of geological materials can be chemically analysed which include water, vegetation, soil, sediment and rock. 

Assay labs can provide single and multi-element analyses by a variety of methods. Rock and soil samples are crushed, powdered, fused or digested in acid and then analysed using several different analytical methods and instruments. 

Water, oil and soil tests

Most metallic ore deposits are formed through the interaction of an aqueous fluid and host rocks. Baseline samples are taken to determine hydrologic conditions and natural occurrences of potentially toxic elements in rocks, soils, and waters. 

Surface geochemical analysis of soil, rock, water, vegetation, and vapour for trace amounts of metals or other elements that may indicate the presence of a buried ore deposit. Geochemical techniques have played a key role in the discovery of numerous mineral deposits, and they continue to be a standard method of exploration. 

Rock, water, soil and vegetation samples collected by prospectors and geoscientists can either be tested on-site or in laboratories called assay labs.

Airborne or ground geophysical surveys

Through either ground or airborne methods, geophysical companies undertake magnetic, radiometric and electromagnetic surveys to detect a response which may indicate potential deposits of mineral resources. 

Airborne geophysical surveys are used for mineral exploration for mapping exposed bedrock, geological structures, sub-surface conductors, paleochannels, mineral deposits and salinity. There are several airborne geophysical methods used for minerals exploration including aeromagnetics, radiometrics and VTEM. A digital elevation model (DEM) is also used as an addition to most airborne geophysical surveys. Gravity surveys can also be conducted from the air as well as from the ground.

Ground-based geophysical surveys are implemented once mining companies have identified potential deposits at a regional scale and are performed from the soil surface, through boreholes, excavations or in a combination of placing sources and detectors.

Mineral exploration involves drilling to probe the contents of known ore deposits and potential sites to produce rock chips and samples of the core. 

Drilling is used in areas that have been identified as targets with potential deposits based on geological, geophysical and geochemical surveys which have led to the design of the drilling programme. The aim is to obtain detailed information about rock types, mineral content, rock fabric, and the relationship between the rock layers close to the surface and at depth. 

Samples taken from the orebody are taken to the lab and geologists can analyse the core by chemical assay and conduct petrologic, structural, and mineralogical studies of the rock.

Crux Investor 5 stages of mining lifecycle

Exploration objectives are to find the ore and the drilling and sampling will provide the information upon which to base estimates of its quantity and grade.

Estimates of ore grade are based on the assays of samples obtained from drill holes into the ore. The accuracy of the estimates will depend on the care taken in procuring the samples and the judgment used in deciding on sample interval required, the accuracy in assaying, and the proper weighting of the individual assays in combining them for determining average grades of individual ore blocks, especially the treatment of erratic high values. 

Valuable minerals are distributed unevenly and are present in varying degrees of purity throughout the material so that assays of individual samples may vary widely throughout sampling.

Socio-economic factors

Companies must also take into account the socio-economic effects that the presence of a new mine could have on the area and surrounding communities. 

Mining activities, including prospecting, exploration, construction, operation, maintenance, expansion, abandonment, decommissioning and repurposing of a mine can impact social and environmental systems in a range of positive and negative ways. Mining companies need to integrate environmental and social impact assessments into mining projects. 

These assessments are the process of determining, analysing and evaluating the potential environmental and social impacts of a mining project, and designing appropriate implementation and management plans for the mining life cycle.

Orebody models

At the end of the exploration stage, miners are able to draw up a preliminary outline of the potential size of the deposits found using 2D or 3D models of the geological ore. An orebody model serves as the geological basis of all resource estimation and starts with a review of existing drill hole and surface or underground sample data as well as maps and plans with current geological interpretation.

2. Discovery Stage

Crux Investor 5 stages of mining lifecycle

Mine-site Design & Planning

Once the miners are sufficiently confident that there is a financially viable amount of deposit, the project can progress to the planning stage. 

Companies will create multiple plans with different variables (time-span, amount of ore mined) to evaluate which fulfils the most criteria.

Planning criteria & permit considerations:

From exploration to mining of mineral resources, it is vital to ensure that critical safety and operational risks are considered in designing a mine. The mine plan should allow the miners to work in the safest way possible.

The safety and wellbeing of employees, contractors and local communities is a big concern for responsible mining companies and a mine plan will look at any aspect of mine operations that could have a direct impact on the wellbeing of workers, contractors and communities.

Environmental impact 

The mine plan needs to be designed to keep the damage to the environment to a minimum using strategies that can reduce environmental impact. Lower impact mining techniques will reduce interference at the mining site. Mining waste such as tailings, rocks and wastewater can be reused on or off-site. 

Eco-friendly equipment such as electric engines which will result in big carbon savings and longer lasting equipment will cut down on waste over time. 

Many former mine-sites are left unusable by landowners once the mine life has come to an end. Mine companies can employ land rehabilitation techniques such as topsoil replenishment and reforestation schemes to make the land productive again and speed up the land’s natural recovery process. 

Illegal mining is a significant issue for the industry so preventing illegal or unregulated mining operations will help ensure that all mining is bound by the same environmental standards and ensure accountability.

Economical viability

Mine development starts when a deposit is discovered and continues through to the start of construction. The technical feasibility and the economic viability of each project are determined during the phases of mine development, with more detailed engineering data required at each stage.

Corporate social responsibility 

Social responsibility is very important in the world of mining and companies are finding it beneficial to strengthen their corporate social responsibility (CSR) efforts and find ways to give back to the surrounding community. 

Mines often employ a large percentage of the local residents as their workforce and some companies get involved by financing local suppliers and so promoting local trade and growing the local economy. They also fund shared infrastructure in power distribution, roads, and water treatment and distribution.

Other companies become involved in local communities by supporting climate change programmes and environmental stewardship and wildlife projects, contributing to local and regional programmes including sponsorship of educational and sporting events, local medical facilities and the funding of local children’s schemes and arts festivals. 

Companies aim to employ local labour and trades people wherever possible and focus on educational, health and infrastructure improvements that will have the greatest impact on the quality of life.

3. Development Stage

diggers digging in a mine

Once the plan has been confirmed, the real work can begin. This is the longest stage of the process so far, and can take anywhere from 10-20 years before the mine is ready for production, depending on the site size. 

Does a mines size affect the amount of ore produced? 

Measuring mine productivity can be difficult given how unique each operation is.

Mines set their production goals but productivity at some mines is restricted by location. Mines are trying to minimize operating expenditure while continuing to increase productivity.

What does construction involve? 

Building roads.

The construction of roads, rail, air-strips or ports to access the mine plus the services such as water, sewage and power is similar to the work required for establishing other types of industries except that this construction could be in remote areas with added logistical challenges.

Mining roads are a critical component of mining infrastructure and the performance of these roads has a direct impact on operational efficiency, costs and safety. A significant proportion of a mine’s cost is associated with material haulage and well-designed and managed roads contribute directly to reductions in cycle times, fuel burn, tyre costs and overall cost per tonne hauled and critically, underpin a safe transport system. 

Processing facilities

Development of the mine itself is different for an open pit to an underground mine and will require different experience and equipment. Porphyry deposits are often large and many of the deposits are near the surface and mined as open pits with large mining equipment; however, at depth some may have suitable characteristics to convert to large underground block caving mines. Vein type deposits are often narrow, can go to depth and are mined by underground methods with smaller equipment.

Once the mineral is extracted from a mine, it is processed and the processing operation depends on which material is excavated. The crushing and processing facility is constructed based on the testing, flow sheet and design determined in the FS. Processing of the ore starts with understanding the mineralogy and the metallurgical testing for crushing, grinding and recovery of the metals and treatment/management of the tailings.

Crux Investor 5 stages of mining lifecycle

Environmental management systems

Environmental aspects are included on the FS which has determined the current environmental habitat and the long-term impact of building the mine. The FS will also have determined the quantity and quality of all ore and waste to be mined plus tailings, the potential to generate acid and other deleterious metals plus how to treat these issues while operating and at closure. 

Also included in the FS is the amount and quantity of water that will be used during operation and whether the water will need long-term treatment. Some countries require the FS as the basis for submitting plans for required mining permits. 

An Environmental Management System (EMS) is part of the management system and includes organizational procedures, environmental responsibilities, and processes and will help the mining company comply with environmental regulations, identify technical and economic benefits, and ensure that corporate environmental policies are adopted and followed. 

Mining companies with economical and technological flexibility have implemented comprehensive EMSs at current sites but these require input from governments, international environmental organizations, educational facilities, and the companies themselves. 

Employee housing 

Mine planning includes decisions on workforce accommodation which will affect not only employee quality of life but also the impacts and relationships with existing local communities. Workforce accommodation are usually community-based (either as purpose-built company towns or integrated within existing local communities) or commuter (fly-in, fly-out) mine camps which will depend on the location of the mine and how remote it is. 

The quality of accommodation underpins the fulfilment, morale and motivation of employees. This is not only relevant to productivity and safety, but also to recruitment and retention, particularly with the significant human resources crisis. If communities exist close to a proposed mine then the accommodation strategy can influence the value-adding potential for the sustainable development of such communities. 

Where mine locations are isolated in remote areas and/or face significant economic, social and political adversity, the decisions on employee housing are more challenging. The mine company will need to understand the complexity of local planning issues and consider environmental, social, economic and political implications, together with the proposed accommodation strategy. 

Other facilities

4. Production Stage

Crux Investor 5 stages of mining lifecycle

Now the mine is finally ready to begin producing. 

What are the two common methods of mining? 

Surface mining .

Surface mining is a broad category of mining in which the soil and rock overlying the mineral deposit is removed. It has been estimated that more than two-thirds of the world’s yearly mineral production is extracted by surface mining. 

Surface mining is the preference for mining companies because removing the terrain surface to access the mineral beneath is often more cost-effective than digging tunnels and shafts to access mineral resources underground. 

Surface mining methods: 

Underground mining 

Underground mining is used to access ores and valuable minerals in the ground by digging into the ground to extract them. There are several underground mining techniques used to excavate hard minerals, usually those containing metals such as ore containing gold, silver, iron, copper, zinc, nickel, tin and lead, but also for excavating ores of gems such as diamonds and rubies.

Underground mining methods: 

Crux Investor 5 stages of mining lifecycle

5. Reclamation Stage

abandoned mine railway tracks overgrown with trees

Before the company can be issued a permit to build the mine, they must first prove that they have the funds and plans to close the mine in a safe and structured way.

Mining is a temporary activity, once the deposit is gone it's time to relocate to a new site. But before they can do this, they must first close and rehabilitate the mine. 

What needs to happen before a mine can close?

The final step in mining operations is closure and reclamation. Mine companies have to think about a mine closure plan before they start to build as governments need assurances that operators have a plan and the required funds to close the mine before they are willing to issue permits.

Detailed environmental studies form a big part of the mine closure plan on how the mine site will be closed and rehabilitated. A comprehensive mine rehab programme will also include:

Ensuring public health and safety

There are many dangers with abandoned mines, many of which are not visible from the outside, including horizontal openings, vertical shafts, explosives and toxic chemicals, dangerous gases, deep water, spoils piles, abandoned unsafe buildings and high walls. Mine companies need to ensure mines are fully closed and sealed to make them safe for the public.

Removing waste and hazardous material

There is a high-volume of waste material that originates from the processes of excavation, dressing and further physical and chemical processing of metalliferous and non-metalliferous minerals and mine companies need to remove waste and hazardous material from the site both during operation and at closure of the mine.

Establishing new landforms and vegetation

Reclamation of mined areas involves the re-establishment of viable soils and vegetation at a mine site. For example, a simple approach could add lime or other materials that will neutralize acidity plus a cover of topsoil to promote vegetation growth. Modifying slopes and planting vegetation will stabilise the soil and prevent erosion. 

Minimising environmental effects

A landscape affected by mining can take a long time to rehabilitate and mine companies need to minimise environmental effects during mine life and mitigate the impacts of mining from from the discovery phase through to closure:

Preserving water quality

The initial closure plan usually focuses on water quality and where the water will go after closure and the quantity of water which will either discharge or migrate into the groundwater system after flooding.

Mine companies must find ways of protecting groundwater and surface water resources and to understand the risks related to water quantity and quality and to develop appropriate engineering controls and reclamation measures.

Stabilising land to protect against erosion

Reduction of slopes by land infill and reclamation, growing plants and trees on mined areas will stabilise the soil and reduce erosion by binding the soil and protecting the ground. Good erosion control will help keep valuable soils on the land and allow natural growth and regeneration. 

Mine closure plans can aim to renovate the site to varying degrees:

1. remediation.

Cleaning up the contaminated area, removing all mine wastes including water and the treatment of water. Isolating contaminated material.

2. Reclamation

Stabilising the terrain, infill, landscaping and topsoil replacement to make the land useful once again.

3. Restoration

Rebuilding any part of the ecosystem that was disturbed as a result of the mine such as flora and fauna. The planting of trees and vegetation native to the area to allow regeneration.

4. Rehabilitation

Rehabilitating the site to a stable and self-rejuvenating state, either as it was before the mine was built or as a new equivalent ecosystem to take local environmental conditions into account. Mines can be repurposed for other uses such as for agriculture, solar panel farms, biofuel production or even recreational and tourist use.

Mine closure process: 

1. shut-down: .

Production stops and workers are reduced. Some skilled workers are retained to permanently shut down the mine. Re-training or early retirement options are sometimes provided.  

2. Decommissioning

The mine is decommissioned by workers or contractors who take apart the mining processing facilities and equipment which is cleaned to be stored or sold. Buildings are repurposed or demolished, warehouse materials are recovered, and waste is disposed of.  

3. Remediation/Reclamation: 

The land and watercourses are reclaimed to a good standard to ensure any landforms and structures are stable, and watercourses are of acceptable water quality. Hazardous materials are removed and land is reshaped and restored by adding topsoil and planting native grasses, trees, or ground cover. 

4. Post-closure:

It is important to assess the reclamation programme post closure and to identify any further actions required. Mines may require long-term care and maintenance after mine closure such as ongoing treatment of mine discharge water, periodic monitoring and maintenance of tailings containment structures, and monitoring any ongoing remediation technologies used such as constructed wetlands. 

What happens to a mine once it’s closed? 

Post-mining land use is an important issue in mine lifecycle planning and there are many extraordinary examples of how mine sites can be repurposed, from underground bike parks to luxury hotels. Some examples of which are: 

Crux Investor 5 stages of mining lifecycle

Now that you understand how a mine works, it's time to decide how you want to invest in mining stocks. Major companies? Junior companies? Gold? Silver? Uranium? This list is endless. This is a good starting point: Complete Guide: How to Invest in Mining Stocks (New 2021)

Looking for even more?

That's where we come in. Crux Investor is an investing app for busy people.

You’ll receive a single stock recommendation each month , curated by industry experts and presented in a clear and focused one-page memo. You’ll also receive access to a platform full of programmes that will allow you to grow your financial knowledge, overall, all at your own pace.

Crux Investor is for anyone interested in saving time while investing with confidence. It's an ideal resource for the novice that needs guidance and is tired of throwing money away with guesses and gambles. But it's also a perfect fit for the experienced investor that wants a faster and more efficient way to arrive at the perfect stock or significantly increase their knowledge.

Finally, you can afford the analysts the big funds use. No more gambling, no more guesswork. Instead, save time, slay stress, and start investing with confidence by joining Crux Investor today.

data mining project cycle

Please note that Internet Explorer version 8.x is not supported as of January 1, 2016. Please refer to this page for more information.

Data Mining Project

Related terms:.

Theoretical Considerations for Data Mining

Robert Nisbet Ph.D. , ... Ken Yale D.D.S., J.D. , in Handbook of Statistical Analysis and Data Mining Applications (Second Edition) , 2018

General Requirements for Success in a Data Mining Project

Following are general requirements for success of a data mining project :

Results will identify “low-hanging fruit,” as in a customer acquisition model where analytic techniques haven't been tried before (and anything rational will work better).

Improved results can be highly leveraged; that is, an incremental improvement in a vital process will have a strong bottom-line impact. For instance, reducing “charge-offs” in credit scoring from 10% to 9.8% could make a difference of millions of dollars.

A team skilled in each required activity. For other than very small projects, it is unlikely that one person will be sufficiently skilled in all activities. Even if that is so, one person will not have the time to do it all, including data extraction, data integration, analytic modeling, and report generation and presentation. But, more importantly, the analytic and business people must cooperate closely so that analytic expertise can build on the existing domain and process knowledge.

Data vigilance: Capture and maintain the accumulating information stream (e.g., model results from a series of marketing campaigns).

Time: Learning occurs over multiple cycles. Early models can be improved by performing error analyses, which can point to changes in the data preparation and modeling methodology to improve future models. Also, champion-challenger tests with multiple algorithms can produce models with enhanced predictability. Successive iterations of model enhancement can generate successive increases in success.

Each of these types of data mining applications followed a common methodology in principle. We will expand on the subject of the data mining process in Chapter 3 .

Business Objectives

David Nettleton , in Commercial Data Mining , 2014

Evaluation of Viability in Terms of Available Data – Specific Considerations

The following list provides specific considerations for evaluating the viability of a data mining project in terms of the available data:

Does the necessary data for the business objectives exist, and does the business have access to it?

If part or all of the data does not exist, can processes be defined to capture or obtain it?

What is the coverage of the data with respect to the business objectives?

What is the availability of a sufficient volume of data over a required period of time, for all clients, product types, sales channels, and so on? (The data should cover all the business factors to be analyzed and modeled. The historical data should cover the current business cycle.)

Is it necessary to evaluate the quality of the available data in terms of reliability? (The reliability depends on the percentage of erroneous data and incomplete or missing data. The ranges of values must be sufficiently wide to cover all cases of interest.)

Are people available who are familiar with the relevant data and the operational processes that generate the data?

Deployment Systems

This chapter has given a brief introduction to how the results of data mining projects (analysis and modeling) can be deployed in the business environment. Simple but effective options such as query/reporting and EIS have been discussed, as have more complex options such as expert systems and case-based systems. The option chosen depends on the type of business, how it is run, and the specific data needs for decision-making. If a simple report of sales leads, each with an associated probability of acceptance, does the job, then installing a complex expert system doesn’t have to feel obligatory. However, even query/reporting and EIS are not necessarily plug and play applications and usually need customizing and some technical support to get them to do what the user wants.

Finally, the following is a brief explanation of a recent news story that serves as a cautionary tale about the preparation of statistical summaries and their usage. In April 2013, world economists were surprised and perplexed by the news that two prestigious Harvard academics had made a basic error when using an Excel spreadsheet to summarize their findings and publish an influential economic model. This model has been used in recent years (circa 2013) by economists around the world to support the economic arguments that countries should cut spending to promote economic growth.

The elemental error was in the definition of an average function cell that was based on the range of values in a given column of data. Each row showed the GDP growth for a given country when the country’s debt-to-GDP ratio was 90 percent or more. The conclusion from the data (which included the error) was that, were a country’s debt-to-GDP ratio to go over 90 percent (the critical threshold), economic growth would drop off sharply. However, the range defined in the average function missed out the last five cells. (These cells included data from the countries of Denmark, Canada, Belgium, Austria, and Australia). If these countries had been included, the average GDP growth would be 2.2 percent instead of −0.1 percent! The evaluations of the findings and of this error are still being debated.

Apart from the error itself, another criticism is the lack of control of how such an error can pass unchecked and become common wisdom. However, aside from the implications of this Excel error on world economic policy, in the context of presenting key business information derived from data mining (which is the theme of this chapter), we can learn the lesson that a report and the data it is based on should be doubled-checked—not by the same person, but by a peer, colleague, manager, or subordinate who is able to independently debug any possible faulty calculations or fundamental assumptions.

Accessory Tools for Doing Data Mining

Before moving into a discussion of the proper algorithms to use for a data mining project , we must take a side trip to help you understand that modeling algorithms are just one set of data mining tools you will use to complete a data mining project. The practice of data mining includes the use of a number of techniques that have been developed to serve as a set of tools in the data miner's toolbox. In the early days of data mining, many of these tools had to be built (usually in SQL or Perl) and used in an ad hoc fashion for every job. Many of these functions have been included as separate objects in data mining packages or “productized” separately. Most jobs will require the data miner to become proficient in even those tools that are not included in a given data mining package. The following tools can help the data miner:

Data access tools : SQL and other database query languages

Data integration tools : extract-transform-load (ETL) tools to access, modify, and load data from different structures and formats into a common output format (e.g., database and flat file)

Data exploration tools : basic descriptive statistics, particularly frequency tables; slicing, dicing, and drill downs

Model management tools : data mining workspace libraries, templates, and projects

Modeling analysis tools : feature selection; model evaluation tools. ( Note: This topic will be expanded in Chapter 11 .)

Miscellaneous tools : in-place data processing (IDP) tools, rapid deployment tools, and model monitoring tools

Being able to use these tools properly can be very helpful in the identification of significant variables, facilitating rapid decision-making necessary to compete successfully in the global marketplace.

Tim Menzies , ... Burak Turhan , in Sharing Data and Models in Software Engineering , 2015

6.2.1 Scouting

In the scout phase, rapid prototyping is used to try many mining methods on the data. In this phase, experimental rigor is less important than exploring the range of user hypotheses. The other goal of this phase is to gain the interest of the users in the induction results.

It is important to stress that feedback to the users can and must appear very early in a data mining project . Users, we find, find it very hard to express what they want from their data. This is especially true if they have never mined it before. However, once we start showing them results, their requirements rapidly mature as initial results help them sharpen the focus of the inductive study. Therefore, we recommend:

Simplicity first. Prior to conducting very elaborate studies, try applying very simple tools to gain rapid early feedback.

For example, simple linear-time column pruners, such as those discussed in the last chapter, comment on what factors are not influential in a particular domain. It can be insightful to discuss information with the users.

Introduction

1.4 how to read this book.

This book covers the following material:

Part I: Data Mining for Managers : The success of an industrial data mining project depends on those technical matters as well as some very important organizational matters. This first section describes those organizational issues.

Part II: Data Mining: A Technical Tutorial : Discusses data mining for software engineering (SE) applications; several data mining methods that form the building blocks for advanced data science approaches for software engineering. For example, in this book, we apply those methods to numerous applications of data mining for SE, including software effort estimation and defect prediction.

Part III: Sharing Data : In this part, we discuss methods for moving data across organizational boundaries. The topics covered here include how to find learning contexts then how to learn across contexts (for cross-company learning); how to handle missing data; privacy; active learning; as well as privacy issues.

Part IV: Sharing Models : In this part, we discuss how to take models learned from one project and adapt and apply them to others. Topics covered here include ensemble learning; temporal learning; and multiobjective optimization.

The chapters of Parts I and II document a flow of ideas while the chapters of Parts III and IV were written to be mostly self-contained. Hence, for the reader who likes skimming, we would suggest reading all of Parts I and II (which are quite short) then dipping into any of the chapters in Parts III and IV, according to your own interests.

To assist in finding parts of the book that most interest you, this book contains several roadmaps :

See Chapter 2 for a roadmap to Part I: Data Mining for Managers .

See the start of Chapter 7 for a roadmap to Part II: Data Mining: A Technical Tutorial .

See Chapter 11 , Section 11.2 , for a roadmap to Part III: Sharing Data .

See Chapter 19 for a roadmap to Part IV: Sharing Models .

4.1 Data analysis patterns

As another guide to readers, from Chapter 12 onwards each chapter starts with a short summary table that we call a data analysis pattern :

Seven principles of inductive software engineering

T. Menzies , in Perspectives on Data Science for Software Engineering , 2016

Principle #4: Be Open Minded

The goal of inductive engineering for SE is to find better ideas than what was available when you started. So if you leave a data mining project with the same beliefs as when you started, you really wasted a lot of time and effort. Hence, some mantras to chant while data mining are:

Avoid a fixed hypothesis. Be respectful but doubtful of all human-suggested domain hypotheses. Certainly, explore the issues that they raise, but also take the time to look further afield.

Avoid a fixed approach for data mining (eg, just using decision trees all the time), particularly for data that has not been mined before.

The most important initial results are the ones that radically and dramatically improve the goals of the project. So seek important results.

Incorporating Various Sources of Data and Information

This chapter discusses data sources that can be accessed for a commercial data analysis project. One way of enriching the information available about a business’s environment and activity is to fuse together various sources of information and data. The chapter begins with a discussion of internal data; that is, data about a business’s products, services, and customers, together with feedback on business activities from surveys, questionnaires, and loyalty and customer cards. The chapter then considers external data—which affects a business and its customers in various ambits—such as demographic and census data, macro-economic data, data about competitors, and data relating to stocks, shares, and investments. Examples are given for each source and where and how the data could be obtained.

Although some readers may be familiar with one or more of these data sources, they may need help selecting which to use for a given data mining project . Table 3.1 gives examples of which data sources are relevant for which business objectives and commercial data mining activities. Columns two through eight show the seven data sources described in this chapter, and the column labeled “Business Objectives” lists generic business examples. Each cell indicates whether a specific data source would be required for the given business objective.

Table 3.1 . Business objectives versus data sources

Data Sources

Primary data sources.

The primary data sources include the data already in the basic data repository derived from a business’s products, services, customers, and transactions. That is, a data mining project could be considered that uses only this elemental data and no other sources. The primary data sources are indicated in the columns labeled “Internal” in Table 3.1 .

Each data mining project must evaluate and reach a consensus on which factors, and therefore which sources, are necessary for the business objective. For example, if the general business objective is to reduce customer attrition and churn (loss of clients to the competition), a factor related to customer satisfaction may be needed that is not currently in the database. Hence, in order to obtain this data, the business might design a questionnaire and launch a survey for its current customers. Defining the necessary data for a data mining project is a recurrent theme throughout Chapters 2 to 9 2 4 5 6 7 8 9 , and the results of defining and identifying new factors may require a search for the corresponding data sources, if available, and/or obtain the data via surveys, questionnaires, new data capture processes, and so on. Demographic data about specific customers can be elicited from them by using the surveys, questionnaires and loyalty registration forms discussed in this chapter.

With reference to demographic data, we distinguish between the general anonymous type (such as that of the census) and specific data about identifiable customers (such as age, gender, marital status, and so on).

Methodologies for Knowledge Discovery Processes in Context of AstroGeoInformatics

Peter Butka PhD , ... Juliana Ivančáková MSc , in Knowledge Discovery in Big Data from Astronomy and Earth Observation , 2020

1.3.3 Proprietary Methodologies – Usage of Specific Tools

While the research or open standard methodologies are more general and tool-free, some of the leaders in the area of data analysis also provide to their customers proprietary solutions, usually based on the usage of their software tools.

One of such examples is the SEMMA methodology from the SAS Institute, which provided a process description on how to follow its data mining tools. SEMMA is a list of steps that guide users in the implementation of a data mining project . While SEMMA provides still quite a general overview of KDP, authors claim that it is a most logical organization of their tools to cover core data mining tasks (known as SAS Enterprise Miner). The main difference of SEMMA with the traditional KDD overview is that the first steps of application domain understanding (or business understanding in CRISP-DM) are skipped. SEMMA also does not include the knowledge application step, so the business aspect is out of scope for this methodology ( Azevedo and Santos, 2008 ). Both these steps are in the knowledge discovery community considered as crucial for the success of projects. Moreover, applying this methodology outside SAS software tools is not easy. The phases of SEMMA and related tasks are the following:

Sample – the first step is data sampling – a selection of the dataset and data partitioning for modeling; the dataset should be large enough to contain representative information and content, but still small enough to be processed efficiently.

Explore – understanding the data, performing exploration analysis, examining relations between the variables, and checking anomalies, all using simple statistics and mostly visualizations.

Modify – methods to select, create, and transform variables (attributes) in preparation for data modeling.

Model – the application of data mining techniques on the prepared variables, the creation of models with (possibly) the desired outcome.

Assess – the evaluation of the modeling results, and analysis of reliability and usefulness of the created models.

IBM Analytics Services have designed a new methodology for data mining/predictive analytics named Analytics Solutions Unified Method for Data Mining/Predictive Analytics (also known as ASUM-DM), 3 which is a refined and extended CRISP-DM. While strong points of CRISP-DM are on the analytical part, due to its open standard nature CRISP-DM does not cover the infrastructure or operations side of implementing data mining projects, i.e., it has only few project management activities, and has no templates or guidelines for such tasks.

The primary goal of ASUM-DM creation was to solve the disadvantages mentioned above. It means that this methodology retained CRISP-DM and augmented some of the substeps with missing activities, tasks, guidelines, and templates. Therefore, ASUM-DM is an extension or refinement of CRISP-DM, mainly in the more detailed formalization of steps and application of (IBM-based) analytics tools. ASUM-DM is available in two versions – an internal IBM version and an external version. The internal version is a full-scale version with attached assets, and the external version is a scaled-down version without attached assets. Some of these ASUM-DM assets or a modified version are available through a service engagement with IBM Analytics Services. Like SEMMA, it is a proprietary-based methodology, but more detailed and with a broad scope of covered steps within the analytical project.

At the end of this section, we also mention that KDPs can be easily extended using agile methods, initially developed for software development. The main application of agile-based aspects is logically in larger teams in the industrial area. Many approaches are adapted explicitly for some company and are therefore proprietary. Generally, KDP is iterative, and the inclusion of more agile aspects is quite natural ( Nascimento and de Oliveira, 2012 ). The AgileKDD method fulfills the OpenUP lifecycle, which implements Agile Manifesto. The project consists of sprints with fixed deadlines (usually a few weeks). Each sprint must deliver incremental value. Another example of an agile process description is also ASUM-DM from IBM, which combines project management and agility principles.

Process Models for Data Mining and Analysis

Colleen McCue , in Data Mining and Predictive Analysis , 2007

4.2 CRISP-DM

What the CIA model brings in terms of specificity to intelligence, and by extension applied public safety and security analysis, the CRISP-DM process model contributes to data mining as a process, which is reflected in its origins. Several years ago, representatives from a diverse array of industries gathered to define the best practices, or standard process, for data mining. 8 The result of this task was the CRoss-Industry Standard Process for Data Mining (CRISP-DM). The CRISP-DM process model was based on direct experience from data mining practitioners, rather than scientists or academics, and represents a “best practices” model for data mining that was intended to transcend professional domains. Data mining is as much analytical process as it is specific algorithms and models. Like the CIA Intelligence Process, the CRISP-DM process model has been broken down into six steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. 9

Business Understanding

Perhaps the most important phase of the data mining process includes gaining an understanding of the current practices and overall objectives of the project. During the business understanding phase of the CRISP-DM process, the analyst determines the objectives of the data mining project . Included in this phase are an identification of the resources available and any associated constraints, overall goals, and specific metrics that can be used to evaluate the success or failure of the project.

Data Understanding

The second phase of the CRISP-DM analytical process is the data understanding step. During this phase, the data are collected and the analyst begins to explore and gain familiarity with the data, including form, content, and structure. Knowledge and understanding of the numeric features and properties of the data (e.g., categorical versus continuous data) will be important during the data preparation process and essential to the selection of appropriate statistical tools and algorithms used during the modeling phase. Finally, it is through this preliminary exploration that the analyst acquires an understanding of and familiarity with the data that will be used in subsequent steps to guide the analytical process, including any modeling, evaluate the results, and prepare the output and reports.

Data Preparation

After the data have been examined and characterized in a preliminary fashion during the data understanding stage, the data are then prepared for subsequent mining and analysis. This data preparation includes any cleaning and recoding as well as the selection of any necessary training and test samples. It is also during this stage that any necessary merging or aggregating of data sets or elements is done. The goal of this step is the creation of the data set that will be used in the subsequent modeling phase of the process.

During the modeling phase of the project, specific modeling algorithms are selected and run on the data. Selection of the specific algorithms employed in the data mining process is based on the nature of the question and outputs desired. For example, scoring algorithms or decision tree models are used to create decision rules based on known categories or relationships that can be applied unknown data. Unsupervised learning or clustering techniques are used to uncover natural patterns or relationships in the data when group membership or category has not been identified previously. These algorithms can be categorized into two general groups: rule induction models or decision trees, and unsupervised learning or clustering techniques. Additional considerations in model selection and creation include the ability to balance accuracy and comprehensibility. Some extremely powerful models, although very accurate, can be very difficult to interpret and thus validate. On the other hand, models that generate output that can be understood and validated frequently compromise overall accuracy in order to achieve this.

During the evaluation phase of the project, the models created are reviewed to determine their accuracy as well as their ability to meet the goals and objectives of the project identified in the business understanding phase. Put simply: Is the model accurate, and does it answer the question posed?

Finally, the deployment phase includes the dissemination of the information. The form of the information can include tables and reports as well as the creation of rule sets or scoring algorithms that can be applied directly to other data.

This model has worked very well for many business applications; 10 however, law enforcement, security, and intelligence analysis can differ in several meaningful ways. Analysts in these settings frequently encounter unique challenges associated with the data, including timely availability, reliability, and validity. Moreover, the output needs to be comprehensible and easily understood by nontechnical end users while being directly actionable in the applied setting in almost all cases. Finally, unlike in the business community, the cost of errors in the applied public safety setting frequently is life itself. Errors in judgment based on faulty analysis or interpretation of the results can put citizens as well as operational personnel at risk for serious injury or death.

Table 4-1 . Comparison of the CRISP-DM and CIA Intelligence Process Models.

The CIA Intelligence Process has unique features associated with its use in support of the intelligence community, including its ability to guide sound policy and information-based operational support. The importance of domain expertise is underscored in the intelligence community by the existence of specific agencies responsible for the collection, processing, and analysis of specific types of intelligence data. The CRISP-DM process model highlights the need for subject matter experts and domain expertise, but emphasizes a common analytical strategy that has been designed to transcend professional boundaries and that is relatively independent of content area or domain. The CIA Intelligence Process and CRISP-DM models are well suited to their respective professional domains; however, they are both somewhat limited in directly addressing the unique challenges and needs related to the direct application of data mining and predictive analytics in the public safety and security arena. Therefore, an integrated process model specific to public safety and security data mining and predictive analytics is outlined below. Like the CIA model, this model recognizes not only a role but also a critical need for analytical tradecraft in the process; and like the CRISP-DM process model, it emphasizes the fact that effective use of data mining and predictive analytics truly is an analytical process that encompasses far more than the mathematical algorithms and statistical techniques used in the modeling phase.

IMAGES

  1. Life cycle of a data mining project.

    data mining project cycle

  2. Data Mining

    data mining project cycle

  3. Data Mining Steps

    data mining project cycle

  4. 21 Best Data Mining Project Ideas For Computer Science Student

    data mining project cycle

  5. Stages of Data Mining Process Stock Illustration

    data mining project cycle

  6. Data Mining Process

    data mining project cycle

VIDEO

  1. Data Mining (Spring 2020)

  2. Covid 19 Data Mining Project Part 1

  3. Introduction to Data Mining

  4. How to make 9v*10 battery ||9w bulb ko kease Jalega-lifehak with Led bulb #youtubeshorts

  5. Course Contents

  6. preparing the data in data mining| data mining concept

COMMENTS

  1. 8 Steps in the Data Life Cycle

    Nearly all data projects, however, follow the same basic life cycle from start to finish. This life cycle can be split into eight common stages, steps, or phases: Generation Collection Processing Storage Management Analysis Visualization Interpretation Below is a walkthrough of the processes that are typically involved in each of them.

  2. The Data Science Project Life Cycle Explained

    Here s our rundown of a data science project life cycle, including the six main steps of the cross-industry standard process for data mining (CRISP-DM) and additional steps from data science solutions that are essential parts of every data science project. This roadmap is based on decades of experience in delivering data modelling and analysis ...

  3. What is Data Mining? Data Mining Explained

    The typical process of data collection, storage, analysis, and mining is outlined below. Data collection is capturing data from different sources like customer feedback, payments, and purchase orders. Data warehousing is the process of storing that data in a large database or data warehouse.

  4. Data mining

    In a nutshell, the project life cycle of a data mining project according to CRISP-DM includes the following phases: Business understanding To identify the business goals and to determine how to measure success. Data understanding To select relevant data and to understand this data.

  5. What is CRISP DM?

    Data Preparation A common rule of thumb is that 80% of the project is data preparation. This phase, which is often referred to as "data munging", prepares the final data set (s) for modeling. It has five tasks: Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.

  6. Data Mining

    The phases of solving a business problem using Data Mining are as follows: Problem Definition in Terms of Data Mining and Business Goals Data Acquisition and Preparation Building and Evaluation of Models Deployment Supervised For a Supervised problem: Optional: Machine Learning - Unsupervised Learning ( Mining )

  7. Lifecycle of Data Science Project

    Life Cycle of a Typical Data Science Project Explained: 1) Understanding the Business Problem: In order to build a successful business model, its very important to first understand the business problem that the client is facing. Suppose he wants to predict the customer churn rate of his retail business.

  8. 5 Steps of a Data Science Project Lifecycle

    5 Steps of a Data Science Project Lifecycle | by Dr. Cher Han Lau | Towards Data Science 500 Apologies, but something went wrong on our end. Refresh the page, check Medium 's site status, or find something interesting to read. Dr. Cher Han Lau 189 Followers Founder at LEAD ( https://thelead.io ).

  9. Data Mining Project Cycle

    The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied.

  10. Data Mining Project Cycle

    Step 1: Data Collection. The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied.

  11. Traditional Data Mining Life Cycle (Crisp Methodology)

    Steps Traditional Data Mining Life Cycle: Business Understanding: This introductory stage centers on understanding the extend destinations and prerequisites from a commerce point of view, and after that changing over this information into a information mining issue definition. A preliminary arrange is planned to attain the destinations.

  12. The Team Data Science Process lifecycle

    The TDSP lifecycle is modeled as a sequence of iterated steps that provide guidance on the tasks needed to use predictive models. You deploy the predictive models in the production environment that you plan to use to build the intelligent applications.

  13. CRISP-DM methodology leader in data mining and big data

    Pipeline: A Data Engineering Resource 3 Data Science Projects That Got Me 12 Interviews. And 1 That Got Me in Trouble. Zach Quinn in Pipeline: A Data Engineering Resource Creating The Dashboard That Got Me A Data Analyst Job Offer Stefan Pircalabu in DataDrivenInvestor 3 Data-Science Certifications you should do in order Marie Truong in

  14. CRISP-DM Help Overview

    The data mining life cycle The life cycle model consists of six phases with arrows indicating the most The sequence of the phases is not strict. fact, most projects move back and forth between phases as necessary. The CRISP-DM model is flexible and can be customized easily.

  15. Phases of the Data Mining Process

    Identifying your business goals. Assessing your situation. Defining your data mining goals. Producing your project plan. Data understanding: Review the data that you have, document it, identify data management and data quality issues. Tasks for this phase include: Gathering data. Describing.

  16. Life cycle of a data mining project.

    The data mining process comprises several steps, such as data selection, pre-processing, transformation, interpretation, and evaluation. The section of the experimentation includes a stroke...

  17. 7 Key Steps in the Data Mining Process

    Here are the 7 key steps in the data mining process -. 1. Data Cleaning. Teams need to first clean all process data so it aligns with the industry standard. Dirty or incomplete data leads to poor insights and system failures that cost time and money.

  18. What is SEMMA?

    The CRoss Industry Standard Process in Data Mining ( CRISP-DM) and the Knowledge Discovery in Databases ( KDD) Process are two similar data mining life cycles. In comparing KDD and SEMMA, on a high level the parallels draw themselves.

  19. The stages of mining: 5 lifecycle processes explained

    The 5 Lifecycle Stages of Mining 1. Exploration & Prospecting Stage This is the first and most essential step of the mining process: in order to open a mine, companies must first find an economically sufficient amount of the deposit (an amount of ore or mineral that makes exploitation worthwhile.)

  20. Data Mining Project

    Data vigilance: Capture and maintain the accumulating information stream (e.g., model results from a series of marketing campaigns). 4. Time: Learning occurs over multiple cycles. Early models can be improved by performing error analyses, which can point to changes in the data preparation and modeling methodology to improve future models.