- Business Essentials
- Leadership & Management
- Entrepreneurship & Innovation
- Finance & Accounting
- Business in Society
- For Organizations
- Support Portal
- Media Coverage
- Founding Donors
- Leadership Team
- Harvard Business School →
- HBS Online →
- Business Insights →
Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.
- Career Development
- Earning Your MBA
- News & Events
- Staff Spotlight
- Student Profiles
- Work-Life Balance
- Alternative Investments
- Business Analytics
- Business Strategy
- Design Thinking and Innovation
- Disruptive Strategy
- Economics for Managers
- Entrepreneurship Essentials
- Financial Accounting
- Global Business
- Launching Tech Ventures
- Leadership Principles
- Leadership, Ethics, and Corporate Accountability
- Leading with Finance
- Management Essentials
- Negotiation Mastery
- Organizational Leadership
- Power and Influence for Positive Impact
- Strategy Execution
- Sustainable Business Strategy
- Sustainable Investing
8 Steps in the Data Life Cycle
- 02 Feb 2021
Whether you manage data initiatives, work with data professionals, or are employed by an organization that regularly conducts data projects, a firm understanding of what the average data project looks like can prove highly beneficial to your career. This knowledge—paired with other data skills —is what many organizations look for when hiring.
No two data projects are identical; each brings its own challenges, opportunities, and potential solutions that impact its trajectory. Nearly all data projects, however, follow the same basic life cycle from start to finish. This life cycle can be split into eight common stages, steps, or phases:
Below is a walkthrough of the processes that are typically involved in each of them.
Access your free e-book today.
Data Life Cycle Stages
The data life cycle is often described as a cycle because the lessons learned and insights gleaned from one data project typically inform the next. In this way, the final step of the process feeds back into the first.
For the data life cycle to begin, data must first be generated. Otherwise, the following steps can’t be initiated.
Data generation occurs regardless of whether you’re aware of it, especially in our increasingly online world. Some of this data is generated by your organization, some by your customers, and some by third parties you may or may not be aware of. Every sale, purchase, hire, communication, interaction— everything generates data. Given the proper attention, this data can often lead to powerful insights that allow you to better serve your customers and become more effective in your role.
Back to top
Not all of the data that’s generated every day is collected or used. It’s up to your data team to identify what information should be captured and the best means for doing so, and what data is unnecessary or irrelevant to the project at hand.
You can collect data in a variety of ways, including:
- Forms: Web forms, client or customer intake forms, vendor forms, and human resources applications are some of the most common ways businesses generate data.
- Surveys: Surveys can be an effective way to gather vast amounts of information from a large number of respondents.
- Interviews: Interviews and focus groups conducted with customers, users, or job applicants offer opportunities to gather qualitative and subjective data that may be difficult to capture through other means.
- Direct Observation: Observing how a customer interacts with your website, application, or product can be an effective way to gather data that may not be offered through the methods above.
It’s important to note that many organizations take a broad approach to data collection, capturing as much data as possible from each interaction and storing it for potential use. While drawing from this supply is certainly an option, it’s always important to start by creating a plan to capture the data you know is critical to your project.
Once data has been collected, it must be processed . Data processing can refer to various activities, including:
- Data wrangling , in which a data set is cleaned and transformed from its raw form into something more accessible and usable. This is also known as data cleaning, data munging, or data remediation.
- Data compression , in which data is transformed into a format that can be more efficiently stored.
- Data encryption , in which data is translated into another form of code to protect it from privacy concerns.
Even the simple act of taking a printed form and digitizing it can be considered a form of data processing.
After data has been collected and processed, it must be stored for future use. This is most commonly achieved through the creation of databases or datasets. These datasets may then be stored in the cloud, on servers, or using another form of physical storage like a hard drive, CD, cassette, or floppy disk.
When determining how to best store data for your organization, it’s important to build in a certain level of redundancy to ensure that a copy of your data will be protected and accessible, even if the original source becomes corrupted or compromised.
Data management , also called database management, involves organizing, storing, and retrieving data as necessary over the life of a data project. While referred to here as a “step,” it’s an ongoing process that takes place from the beginning through the end of a project. Data management includes everything from storage and encryption to implementing access logs and changelogs that track who has accessed data and what changes they may have made.
Data analysis refers to processes that attempt to glean meaningful insights from raw data. Analysts and data scientists use different tools and strategies to conduct these analyses. Some of the more commonly used methods include statistical modeling, algorithms, artificial intelligence, data mining, and machine learning.
Exactly who performs an analysis depends on the specific challenge being addressed, as well as the size of your organization’s data team. Business analysts, data analysts, and data scientists can all play a role.
Data visualization refers to the process of creating graphical representations of your information, typically through the use of one or more visualization tools . Visualizing data makes it easier to quickly communicate your analysis to a wider audience both inside and outside your organization. The form your visualization takes depends on the data you’re working with, as well as the story you want to communicate.
While technically not a required step for all data projects, data visualization has become an increasingly important part of the data life cycle.
Finally, the interpretation phase of the data life cycle provides the opportunity to make sense of your analysis and visualization. Beyond simply presenting the data, this is when you investigate it through the lens of your expertise and understanding. Your interpretation may not only include a description or explanation of what the data shows but, more importantly, what the implications may be.
The eight steps outlined above offer an effective framework for thinking about a data project’s life cycle. That being said, it isn’t the only way to think about data. Another commonly cited framework breaks the data life cycle into the following phases:
While this framework's phases use slightly different terms, they largely align with the steps outlined in this article.
The Importance of Understanding the Data Life Cycle
Even if you don’t directly work with your organization’s data team or projects, understanding the data life cycle can empower you to communicate more effectively with those who do. It can also provide insights that allow you to conceive of potential projects or initiatives.
The good news is that, unless you intend to transition into or start a career as a data analyst or data scientist, it’s highly unlikely you’ll need a degree in the field. Several faster and more affordable options for learning basic data skills exist, such as online courses.
Are you interested in improving your data science and analytical skills? Learn more about our online course Business Analytics , or download the Beginner’s Guide to Data & Analytics to learn how you can leverage the power of data for professional and organizational success.
About the Author
Data Science Central
- Author Portal
- 3D Printing
- AI Data Stores
- AI Hardware
- AI Linguistics
- AI User Interfaces and Experience
- AI Visualization
- Cloud and Edge
- Cognitive Computing
- Containers and Virtualization
- Data Science
- Data Security
- Digital Factoring
- Drones and Robot AI
- Internet of Things
- Knowledge Engineering
- Machine Learning
- Quantum Computing
- Robotic Process Automation
- The Mathematics of AI
- Tools and Techniques
- Virtual Reality and Gaming
- Blockchain & Identity
- Business Agility
- Business Analytics
- Data Lifecycle Management
- Data Privacy
- Data Strategist
- Data Trends
- Digital Communications
- Digital Disruption
- Digital Professional
- Digital Twins
- Digital Workplace
- Marketing Tech
- Agriculture and Food AI
- AI and Science
- AI in Government
- Autonomous Vehicles
- Education AI
- Energy Tech
- Financial Services AI
- Healthcare AI
- Logistics and Supply Chain AI
- Manufacturing AI
- Mobile and Telecom AI
- News and Entertainment AI
- Smart Cities
- Social Media and AI
- Functional Languages
- Other Languages
- Query Languages
- Web Languages
- Education Spotlight
- O’Reilly Media
The Data Science Project Life Cycle Explained
- April 21, 2021 at 11:00 am
As Covid-19 continues to shape the global economy, analytics and business intelligence (BI) projects can help organisations prepare and implement strategies to navigate the crisis. According to the Covid-19 Impact Survey by Dresner Advisory Services, most respondents believe that data-driven decision-making is crucial to survive and thrive during the pandemic and beyond. This article provides a step-by-step overview of the typical data science project life cycle, including some best practices and expert advice.
Results of a survey by OReilly show that enterprises stabilise their adoption patterns for artificial intelligence (AI) across a wide variety of functional areas.
Source: AI adoption in the enterprise 2020
The same survey shows that 53% of enterprises using AI today recognise unexpected outcomes and predictions as the greatest risk when building and deploying machine learning (ML) models .
Being an executive person driving and overseeing data science adoption in your organisation, what can you do to achieve a reliable outcome of your data modelling project while getting the best ROI and mitigating security risks at the same time?
The answer lies in thorough project planning and expert execution at every stage of the data science project life cycle. Whether you use your in-house resources or outsource your project to an external team of data scientists, you should:
- Define a business need or a problem that can be solved by data modelling
- Have an understanding of the scope of work that lies ahead
Heres our rundown of a data science project life cycle, including the six main steps of the cross-industry standard process for data mining (CRISP-DM) and additional steps from data science solutions that are essential parts of every data science project. This roadmap is based on decades of experience in delivering data modelling and analysis solutions for a range of business domains, including e-commerce, retail, fashion and finance. It will help you avoid critical mistakes from the start and ensure smooth rollout and model deployment down the line.
A typical data science project life cycle step by step
1. Ideation and initial planning
Without a valid idea and a comprehensive plan in place, it is difficult to align your model with your business needs and project goals to judge all of its strengths, its scope and the challenges involved. First, you need to understand what business problems and requirements you have and how they can be addressed with a data science solution.
At this stage, we often recommend that businesses run a feasibility study exhaustive research that allows you to define your goals for a solution and then build the team best equipped to deliver it. There are usually several other software development life cycle (SDLC) steps that will run in parallel with data modelling, including solution design, software development, testing, DevOps activities and more . The planning stage is to ensure you have all required roles and skills in your team to make the project run smoothly through all of its stages, meet its purpose and achieve its desired progress within the given time limit.
2. Side SDLC activities: design, software development and testing
As you kick off your data analysis and modelling project, several other activities usually run in parallel as parts of the SDLC. These include product design, software development, quality assurance activities and more. Here, team collaboration and alignment are key to project success.
For your model to be deployed as a ready-to-use solution, you need to make sure that your team is aligned through all the software development stages. Its essential for your data scientists to work closely with other development team members, especially with product designers and DevOps, to ensure your solution has an easy-to-use interface and that all of the features and functionality your data model provides are integrated there in the way thats most convenient to the user. Your DevOps engineers will also play an important role in deciding how the model will be integrated within your real production environment, as it can be deployed as a microservice, which facilitates scaling, versioning and security.
When the product is subject to quality assurance activities, the model gets tested within the teams staging environment and by the customer.
3. Business understanding: Identifying your problems and business needs, strategy and roadmap creation
The importance of understanding your business needs, and the availability and nature of data, cant be underestimated. Every data science project should be business first, hence defining business problems and objectives from the outset.
And in the initial phase of a data science project, companies should also set the key performance indicators and criteria that will be indicative of project success. After defining your business objectives, you should assess the data you have at your disposal and what industry/market data is available and how usable it is.
- Situational analysis . Experienced data scientists should be able to assess your current operational performance, then define any challenges, bottlenecks, priorities and opportunities.
- Defining your ultimate goals . Undertake a rigorous analysis of how your business goals match the modelling approach and understand where the gaps in performance and technology are to define the next steps.
- Building your data modelling strategy . When defining your strategy, two aspects are essential your assets available and how well the potential strategy answers your business goals before building business cases to kick start the process.
- Creating a roadmap . After you have a strategy in place, you need to design a roadmap that encompasses programs that will help you reach your goals, what the key objectives are within each program and all necessary project milestones.
The most important task within the business understanding stage is to define whether the problem can be solved by the available or state-of-the-art modelling and analysis approaches. The second most important task is to understand the domain, which allows data scientists to define new model features, initiate model transformations and come up with improvement recommendations.
4. Data understanding: data acquisition and exploratory data analysis
The preceding stages were intended to help you define your criteria for data science project success. Having those available, your data science team will be able to prepare your data for analysis and recommend which data to use and how.
The better the data you use, the better your model is. So, an initial analysis of data should provide some guiding insights that will help set the tone for modelling and further analysis. Based on your business needs, your data scientists should understand how much data you need to build and train the model.
How can you tell good data from bad data? Data quality is imperative, but how are you to know if your information really isnt up to the required standard? Here are some of the red flags to watch out for:
- It has missing variables and cannot be normalised to a unique basis.
- The data has been collected from lots of very different sources. Information from third parties may come under this banner.
- The data is not relevant to the subject of the algorithm. It might be useful, but not in this instance.
- The data contains contradicting values. This could see the same values for opposing classes or a very broad variation inside one class.
- Upon meeting any one of these red flags, theres a chance that your data will need to be cleaned prior to your implementation of an ML algorithm.
Types of data that can be analysed include financial statements, customer and market demand data, supply chain and manufacturing data, text corpora video and audio, image datasets, as well as time series, logs and signals.
Some types of data are a lot more costly and time-consuming to collect and label properly than others; the process can take even longer than the modelling itself. So, you need to understand how much cost is involved, how much effort is needed and what outcome you can expect, as well as your potential ROI before you make a hefty investment in the project.
5. Data preparation and preprocessing
Once youve established your goals, gained a clear understanding of the data needed and acquired the data, you can move on to data preprocessing. The best method for this depends on the nature of the data you have: there are, for example, different time and cost ramifications for text and image data.
Its a pivotal stage, and your data scientists need to tread carefully when theyre assessing data quality. If there are data values missing and your data scientists use a statistical approach to fill in the gaps, it could ultimately compromise the quality of your modelling results. Your data scientists should be able to evaluate data completeness and accuracy, spot noisy data and ask the right questions to fill any gaps, but its essential to engage domain experts, for consultancy.
Data acquisition is usually done through an Extract, Transform and Load (ETL) pipeline.
The ETL (Extract, Transform and Load) pipeline
ETL is a process of data integration that includes three steps that combine information from various sources. The ETL approach is usually applied to create a data warehouse. The information is extracted from a source, transformed into a specific format for further analysis and loaded into a data warehouse.
The main purpose of data preprocessing is to transform information from images, audio, log, and other sources into numerical, normalised, and scaled values. Another aim of data preparation is to cleanse the information. Its possible that your data is usable; it just serves no outlined purpose. In such a case, 70%-80% of total modelling time ¯may be assigned to data cleansing or replacing data samples that are missing or contradictory.
In many situations, you may need additional feature extraction from your data (like calculating the square from the room width and length for the rent price estimation).
Proper preparation from kick-off will ensure that your data science project gets off on the right foot, with the right goals in mind. An initial data assessment can outline how to prepare your data for further modelling.
We advise that you start from proof of concept (PoC) development, where you can validate initial ideas before your team starts pre-testing on your real-world data. After youve validated your ideas with a PoC, you can safely proceed to production model creation.
Define the modelling technique
Even though you may have chosen a tool at the business understanding stage, the modelling stage begins with choosing the specific modelling technique youll use. At this stage, you generate a number of models that are set up, built and can be trained. ML models linear regression, KNN, Ensembles, Random Forest, etc. and deep learning models RNN, LSTN and GANs are part of this step.
Come up with a test design
Before model creation, the testing method or system should be developed to review the quality and validity. Lets take classification as a data mining task. Error rates can be used as quality measures; thus, you can separate datasets in train, validation sets. And build the model using a train set and make a quality assessment based on the separate test set (a validation set is used for the model/approach selection, not for the final error/accuracy measurement).
Build a model
To develop one or more models, use the modelling tool on the arranged dataset.
- Parameter settings ¯ modelling tools usually allow the adjustment of a wide range of parameters. Make a parameters rundown with their chosen values together with the parameter settings choice justification.
- Models ¯ models suggested by the modelling tool and not the models report.
- Model description s¯ outline the resulting models, report the models interpretations and detail any issues with meanings.
7. Model evaluation
Model selection during the prototyping phase
To assess the model, leverage your domain knowledge, criteria of data mining success and desired test design. After evaluating the success of the modelling application, work together with business analysts and domain experts to review the data mining results in the business context.
Include business objectives and business success criteria at this point. Usually, data mining projects implement a technique several times, and data mining results are obtained by many different methods.
- Model assessment¯ sum up task results, evaluate the accuracy of generated models and rank them in relation to each other.
- Revised parameter settings¯ building upon the evaluation of the model, adjust parameter settings for the next run. Keep modifying parameters until you find the best model(s). Make sure to document modifications and assessments.
Here are some methods used by data scientists to check a models accuracy:
- Lift and gain charts ¯ used for problems in campaign targeting to determine target customers for the campaign. They also estimate the response level you can get from a new target base.
- ROC curve ¯ performance measurement between the false positive rate and true positive rate.
- Gini coefficient ¯ measures the inequality among values of a variable.
- Cross-validation ¯ dividing data into two or three parts; the first is used for model training and the second for the approach selection, and then the third, the test set, is used for the final model performance measurement.
- Confusion matrix¯ a table that compares each classs number of predictions to its number of instances. It can help to define the models accuracy, true positive, false positive, sensitivity and specificity.
The confusion matrix
- Root mean squared error ¯ the average amount of error made. Most used in regression techniques; help to estimate the average amount of wrong predictions.
The assessment method should fit your business objectives. When you turn back to preprocessing to check your approach, you can use different preprocessing techniques, extract some other features and then turn back to the modelling stage. You can also do factor analysis to check how your model reacts to different samples.
8. Deployment: Real-world integration and model monitoring
When the model has passed the validation stage, and you and your stakeholders are 100% happy with the results, only then you can move on to full-scale development integrating the model within your real production environment. The role of engineers like DevOps, MLOps and DB is very important at this stage.
The model consists of a set of scripts that process data from databases, data lakes and file systems (CSV, XLS, URLs), using APIs, ports, sockets or other sources. Youll need some technical expertise to find your way around the models.
Alternatively, you could have a custom user interface built, or have the model integrated with your existing systems for convenience and ease of use. This is easily done via microservices and other methods of integration. Once validation and deployment are complete, your data science team and business leaders need to step back and assess the projects overall success.
9. Data model monitoring and maintenance
A data science project doesnt end with the deployment stage; the maintenance step comes next. Data changes from day to day, so a monitoring system is needed to track the models performance over time.
Once the models performance falls down, monitoring systems can indicate whether a failure needs to be handled, or whether a model should be retrained, or even whether a new model should be implemented. The main purpose of maintenance is to ensure a systems full functionality and optimal performance until the end of its working life.
10. Data model disposition
Data disposition is the last stage in the data science project life cycle, consisting of either data or model reuse/repurpose or data/model destruction. Once the data gets reused or repurposed, your data science project life cycle becomes circular. Data reuse means using the same information several times for the same purpose, while data repurpose means using the same data to serve more than one purpose.
Data or model destruction, on the other hand, means complete information removal. To erase the information, among other things, you can overwrite it or physically destroy the carrier. Data destruction is critical to protect privacy, and failure to delete information may lead to breaches, compliance problems among other issues.
AI will keep shaping the establishment of new business, financial and operating models in 2021 and beyond. The investments of world-leading companies will affect the global economy and its workforce and are likely to define new winners and losers.
The lack of AI-specific skills remains a primary obstacle on the way to adoption in the majority of organisations. In the survey by OReilly, around 58% of respondents typically mentioned the shortage of ML modellers and data scientists, among other skill gaps within their organisations.
Source: AI adoption in the enterprise 2020
Having questions on how your data can be used to help you boost your business performance? We will be happy to answer them. Drop us a line.
Originally published at ELEKS Labs blog .
We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.
Welcome to the newly launched Education Spotlight page! View Listings
- Bahasa Indonesia
- Sign out of AWS Builder ID
- AWS Management Console
- Account Settings
- Billing & Cost Management
- Security Credentials
- AWS Personal Health Dashboard
- Support Center
- Knowledge Center
- AWS Support Overview
- AWS re:Post
- What is Cloud Computing?
- Cloud Computing Concepts Hub
What Is Data Mining?
What is data mining.
Data mining is a computer-assisted technique used in analytics to process and explore large data sets. With data mining tools and methods, organizations can discover hidden patterns and relationships in their data. Data mining transforms raw data into practical knowledge. Companies use this knowledge to solve problems, analyze the future impact of business decisions, and increase their profit margins.
What does the term data mining mean?
“Data mining” is a misnomer because the goal of data mining is not to extract or mine the data itself. Instead, a large amount of data is already present, and data mining extracts meaning or valuable knowledge from it. The typical process of data collection, storage, analysis, and mining is outlined below.
- Data collection is capturing data from different sources like customer feedback, payments, and purchase orders.
- Data warehousing is the process of storing that data in a large database or data warehouse .
- Data analytics is further processing, storing, and analyzing the data using complex software and algorithms.
- Data mining is a branch of data analytics or an analytics strategy used to find hidden or previously unknown patterns in data.
Why is data mining important?
Data mining is a crucial part of any successful analytics initiative. Businesses can use the knowledge discovery process to increase customer trust, find new sources of revenue, and keep customers coming back. Effective data mining aids in various aspects of business planning and operations management. Below are some examples of how different industries use data mining.
Telecom, media, and technology
High-competition verticals like telecom, media, and technology use data mining to improve customer service by finding patterns in customer behavior. For example, a company could analyze bandwidth usage patterns and provide customized service upgrades or recommendations.
Banking and insurance
Financial services can use data mining applications to solve complex fraud, compliance, risk management, and customer attrition problems. For example, insurance companies can discover optimal product pricing by comparing past product performance with competitor pricing.
Education providers can use data mining algorithms to test students, customize lessons, and gamify learning. Unified, data-driven views of student progress can help educators see what students need and support them better.
Manufacturing services can use data mining techniques to provide real-time and predictive analytics for overall equipment effectiveness, service levels, product quality, and supply chain efficiency. For example, manufacturers can use historical data to predict the wear of production machinery and anticipate maintenance. As a result, they can optimize production schedules and reduce downtime.
Retail companies have large customer databases with raw data about customer purchase behavior. Data mining can process this data to derive relevant insights for marketing campaigns and sales forecasts. Through more accurate data models, retail companies can optimize sales and logistics for increased customer satisfaction. For example, data mining can reveal popular seasonal products that can be stocked in advance to avoid last-minute shortages.
How does data mining work?
The Cross-Industry Standard Process for Data Mining (CRISP-DM) is an excellent guideline for starting the data mining process. CRISP-DM is both a methodology and a process model that is industry, tool, and application neutral.
- As a methodology, it describes the typical phases in a data mining project, outlines the tasks involved in each stage, and explains the relationships between these tasks.
- As a process model, CRISP-DM provides an overview of the data mining life cycle.
What are the six phases of the data mining process?
Using the flexible CRISP-DM phases, data teams can move back and forth between stages as needed. Also, software technologies can do some of these tasks or support them.
1. Business understanding
The data scientist or data miner starts by identifying project objectives and scope. They collaborate with business stakeholders to identify certain information.
- Problems that need to be addressed
- Project constraints or limitations
- The business impact of potential solutions
They then use this information to define data mining goals and identify the resources required for knowledge discovery.
2. Data understanding
Once they understand the business problem, data scientists begin preliminary analysis of the data. They gather data sets from various sources, obtain access rights, and prepare a data description report. The report includes the data types, quantity, and hardware and software requirements for data processing. Once the business has approved their plan, they begin exploring and verifying the data. They manipulate the data using basic statistical techniques, assess the data quality, and choose a final data set for the next stage.
3. Data preparation
Data miners spend the most time on this phase because data mining software requires high-quality data. Business processes collect and store data for reasons other than mining, and data miners must refine it before using it for modeling. Data preparation involves the following processes.
Clean the data
For example, handle missing data, data errors, default values, and data corrections.
Integrate the data
For example, combine two disparate data sets to get the final target data set.
Format the data
For example, convert data types or configure data for the specific mining technology being used.
4. Data modeling
Data miners input the prepared data into the data mining software and study the results. To do this, they can choose from multiple data mining techniques and tools. They must also write tests to assess the quality of data mining results. To model the data, data scientists can:
- Train the machine learning (ML) models on smaller data sets with known outcomes
- Use the model to analyze unknown data sets further
- Adjust and reconfigure the data mining software until the results are satisfactory
After creating the models, data miners start measuring them against the original business goals. They share the results with business analysts and collect feedback. The model might answer the original question well or show new and previously unknown patterns. Data miners can change the model, adjust the business goal, or revisit the data, depending on the business feedback. Continual evaluation, feedback, and modification are part of the knowledge discovery process.
During deployment, other stakeholders use the working model to generate business intelligence. The data scientist plans the deployment process, which includes teaching others about the model functions, continually monitoring, and maintaining the data mining application. Business analysts use the application to create reports for management, share results with customers, and improve business processes.
What are the techniques for data mining?
Data mining techniques draw from various fields of learning that overlap, including statistical analysis, machine learning (ML), and mathematics. Some examples are given below.
Association rule mining
Association rule mining is the process of finding relationships between two different, seemingly unrelated data sets. If-then statements demonstrate the probability of a relationship between two data points. Data scientists measure result accuracy using support and confidence criteria. Support measures how frequently the related elements appear in the data set, while confidence shows the number of times an if-then statement is accurate.
For example, when customers buy an item, they also often buy a second related item. Retailers can use association mining on past purchase data to identify a new customer's interest. They use data mining results to populate the recommended sections of online stores.
Classification is a complex data mining technique that trains the ML algorithm to sort data into distinct categories. It uses statistical methods like decision trees and nearest-neighbor to identify the category. In all these methods, the algorithm is preprogrammed with known data classifications to guess the type of a new data element.
For example, analysts can train the data mining software by using labeled images of apples and mangoes. With some accuracy, the software can then predict if a new picture is an apple, mango, or other fruit.
Clustering is grouping multiple data points together based on their similarities. It is different from classification because it cannot distinguish the data by specific category but can find patterns in their similarities. The data mining result is a set of clusters where each collection is distinct from other groups, but the objects in each cluster are similar in some way.
For example, cluster analysis can help with market research when working with multivariate data from surveys. Market researchers use cluster analysis to divide consumers into market segments and better understand the relationships between different groups.
Sequence and path analysis
Data mining software can also look for patterns in which a particular set of events or values leads to later ones. It can recognize some variation in data that happens at regular intervals or in the ebb and flow of data points over time.
For example, a business might use path analysis to discover that certain product sales spike just before the holidays or to notice that warmer weather brings more people to its website.
What are the types of data mining?
Depending on the data and the purpose of mining, data mining can have various branches or specializations. Let's look at some of them below.
Process mining is a branch of data mining that aims to discover, monitor, and improve business processes. It extracts knowledge from event logs that are available in information systems. It helps organizations see and understand what's happening in these processes from day to day.
For example, e-commerce businesses have many processes, like procurement, sales, payments, collection, and shipping. By mining their procurement data logs, they might see that their supplier delivery reliability is 54% or that 12% of suppliers are consistently delivering early. They can use this information to optimize their supplier relationships.
Text mining or text data mining is using data mining software to read and comprehend text. Data scientists use text mining to automate knowledge discovery in written resources like websites, books, emails, reviews, and articles.
For example, a digital media company could use text mining to automatically read comments on its online videos and classify audience reviews as positive or negative.
Predictive data mining uses business intelligence to predict trends. It helps business leaders study the impact of their decisions on the company’s future and make effective choices.
For example, a company might look at past product returns data to design a warranty scheme that does not lead to losses. Using predictive mining, they will predict the potential number of returns in the coming year and create a one-year warranty plan that considers the loss when determining the product price.
How can AWS help with data mining?
Amazon SageMaker is a leading data mining software platform. It helps data miners and developers prepare, build, train, and deploy high-quality machine learning (ML) models. It includes several tools for the data mining process.
- Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for mining from weeks to minutes.
- Amazon SageMaker Studio provides a single, web-based visual interface where data scientists can perform ML development steps, which improves the data science team’s productivity. SageMaker Studio gives complete access, control, and insight into each step as data scientists build, train, and deploy models.
- Distributed training libraries use partitioning algorithms to automatically split large models and training data sets for modeling.
- Amazon SageMaker Debugger optimizes ML models by capturing real-time training metrics, such as sending alerts when anomalies are detected. This helps to fix inaccurate model predictions immediately.
Get started with data mining by creating a free AWS account today.
Data Mining With AWS Next Steps
Ending Support for Internet Explorer
- The 1 Percent Rule
- (Absolute|True) Zero
- (Parameters | Model) (Accuracy | Precision | Fit | Performance) Metrics
- Adjusted R^2
- Akaike information criterion (AIC)
- (Anomaly|outlier) Detection
- Apriori algorithm
- Association (Rules Function|Model) - Market Basket Analysis
- Attribute (Importance|Selection) - Affinity Analysis
- Area under the curve (AUC)
- Automatic Discovery
- Bootstrap aggregating (bagging)
- (Base rate fallacy|Bonferroni's principle)
- (Baseline|Naive) classification (Zero R)
- Bayes’ Theorem (Probability)
- Benford's law (frequency distribution of digits)
- Best Subset Selection Regression
- Bias-variance trade-off (between overfitting and underfitting)
- Bias (Sampling error)
- Bayesian Information Criterion (BIC)
- Bimodal Distribution
- Binary logistic regression
- Mathematics - Combination (Binomial coefficient|n choose k)
- (Probability|Statistics) - Binomial Distribution
- Data Mining, Book
- (Boosting|Gradient Boosting|Boosting trees)
- Decision boundary Visualization
- (C4.5|J48) algorithm
Documentation / Reference
- Causation - Causality (Cause and Effect) Relationship
- Cumulative Distribution Function (CDF)
- Centering Continous Predictors
- Central limit theorem (CLT)
- centroid (center of gravity)
- Data-Science - Cheatsheet
- (Class|Category|Label) Target
- (Classifier|Classification Function)
- Clustering (Function|Model)
- Coin Flipping
- (Prediction|Recommender System) - Collaborative filtering
- Competitions (Kaggle and others)
- Confidence Interval
- Statistics - (Confidence|likelihood) (Prediction probabilities|Probability classification)
- Confounding (factor|variable) - (Confound|Confounder)
- Confusion Matrix
- Content Analysis and Acquisition
- Continuous Variable
- Correlation (Coefficient analysis)
- Cosine Similarity (Measure of Angle)
- Mallow's Cp
- Cross Product (of X and Y) (CP|SP)
- (Statistics|Data Mining) - (K-Fold) Cross-validation (rotation estimation)
- (Periodicity|Periodic phenomena|Cycle)
- (Data|Knowledge) Discovery - Statistical Learning
- Data (Preparation | Wrangling | Munging)
- Data Product
- Data - Science
- Data Scientist
- Decision Tree (DT) Algorithm
- Decision Stump
- Deep Learning (Network)
- (Degree|Level) of confidence
- Degree of freedom (df)
- (dependent|paired sample) t-test
- Math - Derivative (Sensitivity to Change, Differentiation)
- Design Matrix (X)
- Deviation Score (for one observation)
- Rolling a die (many dice)
- (Dimension|Feature) (Reduction)
- Dimensionality (number of variable, parameter) (P)
- (Data|Text) Mining - Word-sense disambiguation (WSD)
- (Discretizing|binning) (bin)
- Quadratic discriminant analysis (QDA)
- Discriminant analysis
- (Discriminative|conditional) models
- Statistics / Distribution - (Function)
- Dummy (Coding|Variable) - One-hot-encoding (OHE)
- Effect Size
- Effects (between predictor variable)
- Elastic Net Model
- Ensemble Learning (meta set)
- Entropy (Information Gain)
- (Error|misclassification) Rate - false (positives|negatives)
- Prediction Error (Training versus Test)
- (Estimator|Point Estimate) - Predicted (Score|Target|Outcome| )
- Exponential Distribution
- Face Recognition
- Factor Analysis
- (Factor Variable|Qualitative Predictor)
- Factorial Anova
- Feature Engineering
- (Feature|Attribute) Extraction Function
- Feature Hashing
- (Attribute|Feature) (Selection|Importance)
- Fraud Detection
- Frequency Distribution
- (Frequent itemsets|co-occurring items)
- Data Model - Fudge factor
- Fuzzy Logic (Partial Truth)
- Galton board
- Generalized additive model (GAM)
- Statistics / Gaussian function ( )
- Gaussian processes (modelling probability distributions over functions)
- Generalized Boosted Regression Models
- Generative Model
- Getting Started
- Generalized Linear Models (GLM) - Extensions of the Linear Model
- (Stochastic) Gradient descent (SGD)
- Grouping (Classification)
- Hierarchical Clustering
- High Dimension (Curse of Dimensionality)
- Data Science - History
- Hypothesis (Tests|Testing)
- ID3 Algorithm
- Intrusion detection systems (IDS) / Intrusion Prevention / Misuse
- Image classification
- independent t-test
- Statistical - Inference
- Information Gain
- Information Retrieval
- (Interaction|Synergy) effect
- Intercept - Regression (coefficient|constant)
- Model Interpretation
- (Interval|Delta) (Measurement)
- Java API for data mining (JDM)
- k-Means Clustering algorithm
- K-Nearest Neighbors (KNN) algorithm - Instance based learning
- Knots (Cut points)
- Kurtosis (Distribution Tail extremity)
- Statistical Learning - Lasso
- Standard Least Squares Fit (Gaussian linear model)
- Leptokurtic distribution
- (Life cycle|Project|Data Pipeline)
- Fisher (Multiple Linear Discriminant Analysis|multi-variant Gaussian)
- Statistical Learning - Simple Linear Discriminant Analysis (LDA)
- Linear (Regression|Model)
- (Linear spline|Piecewise linear function)
- Little r - (Pearson product-moment Correlation coefficient)
- LOcal (Weighted) regrESSion (LOESS|LOWESS)
- Global vs Local
- log-likelihood function (cross-entropy)
- Logistic regression (Classification Algorithm)
- (Logit|Logistic) (Function|Transformation)
- Loss functions (Incorrect predictions penalty)
- Data Science - (Kalman Filtering|Linear quadratic estimation (LQE))
- Machine Learning
- Main Effect
- Probability mass function (PMF)
- Maximum Entropy Algorithm
- Maximum likelihood
- (Missing Value|Not Available) NA
- Model Size (d)
- Model vs Expert
- Moderator Variable (Z) - Moderation
- (Average|Mean) Squared (MS) prediction error (MSE)
- Multi-class (classification|problem)
- Multi-variant logistic regression
- (Multiclass Logistic|multinomial) Regression
- Multidimensional scaling ( similarity of individual cases in a dataset)
- Multiple Linear Regression
- Naive Bayes (NB)
- (Probabilistic?) Neural Network (PNN)
- (No Predictor|Mean|Null) Model
- Noise (Unwanted variation)
- Multi-response linear regression (Linear Decision trees)
- Non-linear (effect|function|model)
- Non-Negative Matrix Factorization (NMF) Algorithm
- (Normal|Gaussian) Distribution - Bell Curve
- Orthogonal Partitioning Clustering (O-Cluster or OC) algorithm
- Odds (Ratio)
- (One|Simple) Rule - (One Level Decision Tree)
- Outliers Cases
- (Overfitting|Overtraining|Robust|Generalization) (Underfitting)
- Data Science - Over-generalization
- (Paretian|Power law) distribution
- Pareto ( Principle | Distribution )
- Pascal Triangle
- What is a Pattern ?
- Principal Component (Analysis|Regression) (PCA|PCR)
- (Probability) Density Function (PDF)
- Mathematics - Permutation (Ordered Combination)
- Piecewise polynomials
- Partial least squares (PLS)
- Predictive Model Markup Language (PMML)
- Poisson Distribution
- (Global) Polynomial Regression (Degree)
- Population Parameter
- Post-hoc test
- Power of a test
- (Machine|Statistical) Learning - (Predictor|Feature|Regressor|Characteristic) - (Independent|Explanatory) Variable (X)
- Probability (of an event) / Likelihood
- Probit Regression (probability on binary problem)
- Process control (SPC)
- Pruning (a decision tree, decision rules)
- R-squared ( |Coefficient of determination) for Model Accuracy
- Random forest
- Random Variable (Random quantity|Aleatory variable|Stochastic variable)
- (Fraction|Ratio|Percentage|Share) (Variable|Measurement)
- (Regression Coefficient|Weight|Slope) (B)
- Assumptions underlying correlation and regression analysis (Never trust summary statistics alone)
- (Machine learning|Inverse problems) - Regularization
- Reinforcement learning
- Sampling - Sampling (With|without) replacement (WR|WOR)
- (Residual|Error Term|Prediction error|Deviation) (e| )
- Result Considerations
- Ridge regression
- Root Mean Square (RMS)
- Root mean squared (Error|Deviation) (RMSE|RMSD)
- ROC Plot and Area under the curve (AUC)
- Rote Classifier
- Residual sum of Squares (RSS) = Squared loss ?
- (Decision) Rule
- Sampling Distribution
- Sampling Error
- (Scales of measurement|Type of variables)
- Scoring (Applying)
- (Shrinkage|Regularization) of Regression Coefficients
- Signal (Wanted Variation)
- Significance level
- (Significance | Significant) Effect
- (Univariate|Simple) Logistic regression
- (Univariate|Simple|Basic) Linear Regression
- Simple Effect
- Skew (-ed Distribution|Variable)
- ( Spread | Variability ) of a sample
- Standard Deviation (SD|s| |RMS width)
- Standard Error (SE)
- Forward and Backward Stepwise (Selection|Regression)
- (Supervised|Directed) Learning ( Training ) (Problem)
- Support Vector Machines (SVM) algorithm
- Singular Value Decomposition (SVD)
- (Student's) t-test (Mean Comparison)
- (Machine|Statistical) Learning - (Target|Learned|Outcome|Dependent|Response) (Attribute|Variable) (Y|DV)
- (Test|Expected|Generalization) Error
- (Threshold|Cut-off) of binary classification
- Titanic Data Set
- Training Error
- Training (Data|Set)
- Nested (Transactional|Historical) Data
- Treatments (Combination of factor level)
- True score (Classical test theory)
- (True Function|Truth)
- (Total) Sum of the square (TSS|SS)
- Tuning Parameter
- (two class|binary) classification problem (yes/no, false/true)
- Statistical Learning - Two-fold validation
- Data - Uncertainty
- Uniform Distribution (platykurtic)
- Unsupervised Learning ( Mining )
- Resampling through Random Percentage Split
- Validity (Valid Measures)
- (Variance|Dispersion|Mean Square) (MS)
- Variation (Change?)
- Probability and Vizualization
- Statistics vs (Machine Learning|Data Mining)
- Random Walk
- (Golf|Weather) Data Set
- Z Score (Zero Mean) or Standard Score
Data Mining - (Life cycle|Project|Data Pipeline)
Table of contents, articles related, observation against perturbation, data preparation, a model is dynamic, pitfall / pratfall.
Data mining is an experimental science.
Data mining reveals correlation, not causation .
- Good features: Use a simple algorithm (linear regression for example).
- No meaningful features: Use an “intelligent” algorithms which have a tendency to overfit
- Decide which model to use
From data to information ( patterns , or expectations, that underlie them)
Any data scientist worth their salary will say you should start with a question, NOT the data, @JakePorway
Most #bigdata problems can be addressed by proper sampling/filtering and running models on a single (perhaps large) machine … Chris Volinsky
(Statistics|Probability|Machine Learning|Data Mining|Data and Knowledge Discovery|Pattern Recognition|Data Science|Data Analysis)
The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively Fred Mosteller and John Tukey, paraphrasing George Box
In other words, if you want to make a causal statement about a predictor for an outcome, you actually have to be able to take the system and perturb that particular predictor keeping the other ones fixed.
That will allow you to make a causal statement about a predictor variable and its effect on the outcome. It's not good enough simply to observe some observations from the system. Data from this observation can't conclude to causality .
So in order to know what happens when a complex system is perturbed, it must be perturbed not only observed.
The following paragraph must be merged in one.
- Google Cloud Dataprep , an intelligent, fully-managed cloud service (built in collaboration with Trifacta) that visually explores, cleans and prepares structured and unstructured data for analysis or training machine-learning models.
- Define the question of interest, Identify the problem
- (Get|Collect) the data
- Data Preparation: Prepare the data (Integrate, transform, clean , filter aggregate) (Data Processing|Data Integration)
- (Explore|Interact) with the data (And always visualize the data to understand the distribution. See the Anscombe's quartet to understand why ? )
- ? train a model to distinguish between your training set & unlabeled data. If it works, your training data may be incomplete! Jake van der Plas
- (Build|Fit) a model
- Evaluation is how to determine if the classifier is a good representation.
- Communicate the results
- Make the analysis reproducible
- Build the representation that maximize accuracy
- How to make the evaluation more efficient by reducing the search space.
- Choose a classifier with a knowledge representation (how the data is classified - decision tree, rule, …)
Learning is iterative:
- Apply Model to data
- Observe Errors
- Update Model
- Data Mining - Result Considerations
- Ask question. “tell me something cool about the data” is not enough!
- Collect Data
- Define New Features
- Center (Normalize) (Standardize) : Transform numeric attributes to have zero mean (or into a given numeric range) (or to have zero mean and unit variance)
- Discretize: Discretize numeric attributes to have nominal values
- PrincipalComponents (PCA) : Perform a principal components analysis/transformation of the data
- RemoveUseless: Remove attributes that do not vary at all, or vary too much
- TimeSeriesDelta, TimeSeriesTranslate: Replace attribute values with successive differences between this instance and the next
The phases of solving a business problem using Data Mining are as follows:
- Problem Definition in Terms of Data Mining and Business Goals
- Data Acquisition and Preparation
- Building and Evaluation of Models
For a Supervised problem:
- Optional: Machine Learning - Unsupervised Learning ( Mining )
- Statistics - Model Building (Training|Learning|Fitting)
- Statistics - Model Evaluation (Estimation|Validation|Testing)
- Data Mining - Scoring (Applying)
Cross Industry Standard Process Model for Data Mining
The Cross Industry Standard Process Model for Data Mining (CRISP-DM). From: An Oracle White Paper - February 2013 - Information Management and Big Data A Reference Architecture
- an Analyst will first build both a business and data understanding in order to develop a testable hypothesis.
- Identify data of interest
- Data exploration with data Profiling, Data Quality, Statistics and viz tools
- models built
- evaluated (both technically and commercially) before deploying
https://eng.uber.com/michelangelo/ 6 steps:
- Manage data
- Train models
* Evaluate models
- Deploy models
- Make predictions
- Monitor predictions
When Google rolled out flu stories in Google News, people started reading about flu in the news and searching on those stories and that skewed their results. During the period from 2011 to 2013, it overestimated the prevalence of flu (factor of two in 2012 and 2013). They needed to take this new factor into account.
Google Flu Trends teaches us that the modelling process cannot be static, but rather we must periodically revist the process and understand what underlying factors, if any, may have changed.
- Pitfall: A hidden or unsuspected danger or difficulty
- Pratfall: A stupid and humiliating action
- Weka MOOC course unit 5
- Data science done well looks easy - and that is a big problem for data scientists
Share this page:
Data (State) Data (State) DataBase Data Processing Data Quality Data Structure Data Type Data Warehouse Data Visualization Data Partition Data Persistence Data Concurrency
Data Science Data Analysis Statistics Data Science Linear Algebra Mathematics Trigonometry
Modeling Process Logical Data Modeling Relational Modeling Dimensional Modeling Automata
Data Type Number Time Text Collection Relation (Table) Cube Tree Key/Value Graph Spatial Color Log
Measure Levels Order Nominal Discrete Distance Ratio
Code Compiler Lexical Parser Grammar Function Testing Debugging Shipping Data Type Versioning Design Pattern
Infrastructure Operating System Monitoring Cryptography Security File System Network Process (Thread) Computer Infra As Code
Marketing Advertising Analytics Email
Introduction to Life Cycle of Data Science projects (Beginner Friendly)
This article was published as a part of the Data Science Blogathon
As a data scientist aspirant, you must be keen to understand how the life cycle of data science projects works so that it’s easier for you to implement your individual projects in a similar pattern. Today, we will be basically discussing the step-by-step implementation process of any data science project in a real-world scenario.
What is a Data Science Project Lifecycle?
In simple terms, a data science life cycle is nothing but a repetitive set of steps that you need to take to complete and deliver a project/product to your client. Although the data science projects and the teams involved in deploying and developing the model will be different, every data science life cycle will be slightly different in every other company. However, most of the data science projects happen to follow a somewhat similar process.
In order to start and complete a data science-based project, we need to understand the various roles and responsibilities of the people involved in building, developing the project. Let us take a look at those employees who are involved in a typical data science project:
Who Are Involved in The Projects:
- Business Analyst
- Data Analyst
- Data Scientists
- Data Engineer
- Data Architect
- Machine Learning Engineer
Now that we have an idea of who all are involved in a typical business project, let’s understand what is a data science project and how do we define the life cycle of the data science project in a real-world scenario like a fake news identifier.
Why do we need to define the Life Cycle of a data science project?
In a normal case, a Data Science project contains data as its main element. Without any data, we won’t be able to do any analysis or predict any outcome as we are looking at something unknown. Hence, before starting any data science project that we have got from either our clients or stakeholder first we need to understand the underlying problem statement presented by them. Once we understand the business problem, we have to gather the relevant data that will help us in solving the use case. However, for beginners many questions arise like:
In what format do we need the data?
How to get the data?
What do we need to do with data?
So many questions yet answers might vary from person to person. Hence in order to address all these concerns right away, we do have a pre-defined flow that is termed as Data Science Project Life Cycle. The process is fairly simple wherein the company has to first gather data, perform data cleaning, perform EDA to extract relevant features, preparing the data by performing feature engineering and feature scaling. In the second phase, the model is built and deployed after a proper evaluation. This entire lifecycle is not a one man’s job, for this, you need the entire team to work together to get the work done by achieving the required amount of efficiency for the project
The globally accepted structure in resolving any sort of analytical problem is popularly known as Cross Industry Standard Process for Data Mining or abbreviated as CRISP-DM framework .
Life Cycle of a Typical Data Science Project Explained:
1) Understanding the Business Problem:
In order to build a successful business model, its very important to first understand the business problem that the client is facing. Suppose he wants to predict the customer churn rate of his retail business. You may first want to understand his business, his requirements and what he is actually wanting to achieve from the prediction. In such cases, it is important to take consultation from domain experts and finally understand the underlying problems that are present in the system. A Business Analyst is generally responsible for gathering the required details from the client and forwarding the data to the data scientist team for further speculation. Even a minute error in defining the problem and understanding the requirement may be very crucial for the project hence it is to be done with maximum precision.
After asking the required questions to the company stakeholders or clients, we move to the next process which is known as data collection.
2) Data Collection
After gaining clarity on the problem statement, we need to collect relevant data to break the problem into small components.
The data science project starts with the identification of various data sources, which may include web server logs, social media posts, data from digital libraries such as the US Census datasets, data accessed through sources on the internet via APIs, web scraping, or information that is already present in an excel spreadsheet. Data collection entails obtaining information from both known internal and external sources that can assist in addressing the business issue.
Normally, the data analyst team is responsible for gathering the data. They need to figure out proper ways to source data and collect the same to get the desired results.
There are two ways to source the data:
- Through web scraping with Python
- Extracting Data with the use of third party APIs
3) Data Preparation
After gathering the data from relevant sources we need to move forward to data preparation. This stage helps us gain a better understanding of the data and prepares it for further evaluation.
Additionally, this stage is referred to as Data Cleaning or Data Wrangling. It entails steps such as selecting relevant data, combining it by mixing data sets, cleaning it, dealing with missing values by either removing them or imputing them with relevant data, dealing with incorrect data by removing it, and also checking for and dealing with outliers. By using feature engineering, you can create new data and extract new features from existing ones. Format the data according to the desired structure and delete any unnecessary columns or functions. Data preparation is the most time-consuming process, accounting for up to 90% of the total project duration, and this is the most crucial step throughout the entire life cycle.
Exploratory Data Analysis (EDA) is critical at this point because summarising clean data enables the identification of the data’s structure, outliers, anomalies, and trends. These insights can aid in identifying the optimal set of features, an algorithm to use for model creation, and model construction.
4) Data Modeling
Throughout most cases of data analysis, data modeling is regarded as the core process. In this process of data modeling, we take the prepared data as the input and with this, we try to prepare the desired output.
We first tend to select the appropriate type of model that would be implemented to acquire results, whether the problem is a regression problem or classification, or a clustering-based problem. Depending on the type of data received we happen to choose the appropriate machine learning algorithm that is best suited for the model. Once this is done, we ought to tune the hyperparameters of the chosen models to get a favorable outcome.
Finally, we tend to evaluate the model by testing the accuracy and relevance. In addition to this project, we need to make sure there is a correct balance between specificity and generalizability, which is the created model must be unbiased.
5) Model Deployment
Before the model is deployed, we need to ensure that we have picked the right solution after a rigorous evaluation has been. Later on, it is then deployed in the desired channel and format. This is naturally the last step in the life cycle of data science projects. Please take extra caution before executing each step in the life cycle to avoid unwanted errors. For example, if you choose the wrong machine learning algorithm for data modeling then you will not achieve the desired accuracy and it will be difficult in getting approval for the project from the stakeholders. If your data is not cleaned properly, you will have to handle missing values or the noise present in the dataset later on. Hence, in order to make sure that the model is deployed properly and accepted in the real world as an optimal use case, you will have to do rigorous testing in every step.
All the steps mentioned above are equally applicable for beginners as well as seasoned data science practitioners. As a beginner, your job is to learn the process first, then you need to practice and deploy smaller projects like fake news detector, titanic dataset, etc. You can refer to portals like analyticsvidhya.com , kaggle.com , hackerearth.com to get the dataset and start working on it.
Luckily for beginners, these portals have already cleaned most of the data, and hence proceeding with the next steps will be fairly easy. However, in the real world, you have to acquire not just any data set but the data that might meet the requirements of your data science project. Hence, initially, your job is to first proceed with all the steps of the data science life cycle very sincerely, and once you are thorough with the process and deployment you are ready to take the next step towards a career in this field. Python and R are the two languages that are most widely used in data science use cases.
Nowadays, even Julia is becoming one of the preferred languages for deploying the model. However, along with the clarity in the process, you should be comfortable in coding via such languages. From process understanding to proficiency in the programming language, you need to be adept with all.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
About the Author
Our top authors.
Download Analytics Vidhya App for the Latest blog/Article
Leave a reply your email address will not be published. required fields are marked *.
Notify me of follow-up comments by email.
Notify me of new posts by email.
30 Best Data Science Books to Read in 2023
How to Read and Write With CSV Files in Python:..
Understand Random Forest Algorithms With Examples (Updated 2023)
Feature Selection Techniques in Machine Learning (Updated 2023)
Welcome to India's Largest Data Science Community
Back welcome back :), don't have an account yet register here, back start your journey here, already have an account login here.
A verification link has been sent to your email id
If you have not recieved the link please goto Sign Up page again
back Please enter the OTP that is sent to your registered email id
Back please enter the otp that is sent to your email id, back please enter your registered email id.
This email id is not registered with us. Please enter your registered email id.
back Please enter the OTP that is sent your registered email id
Please create the new password here, privacy overview.
Data Mining Tutorial
Data Mining Project Cycle - Data Mining
- Interview Question
- All Tutorials
- Practice Test
What is the life cycle of a data mining project? What are the challenging steps? Who should be involved in a data mining project? To answer these questions, let’s go over a typical data mining project step by step.
Step 1: Data Collection The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied. For instance, if you want to analyze the Web click stream and your company has a dozen Web servers, the first step is to download the Web log data from each Web server.
Sometimes you might be lucky. The data warehouse on the subject of your analysis already exists. However, the data in the data warehouse may not be rich enough. You may still need to gather data from other sources. Suppose that there is a click stream data warehouse containing all the Web clicks on the Web site of your company. You have basic information about customers’ navigation patterns. However, because there is not much demographic information about your Web visitors, you may need to purchase or gather some demographic data from other sources in order to build a more accurate model.
After the data is collected, you can sample the data to reduce the volume of the training dataset. In many cases, the patterns contained in 50,000 customers are the same as in 1 million customers.
Step 2: Data Cleaning and Transformation Data cleaning and transformation is the most resource-intensive step in a data mining project. The purpose of data cleaning is to remove noise and irrelevant information out of the dataset. The purpose of data transformation is to modify the source data into different formats in terms of data types and values. There are various techniques you can apply to data cleaning and transformation, including:
Data type transform: This is the simplest data transform. An example is transforming a Boolean column type to integer. The reason for this transform is that some data mining algorithms perform better on integerdata, while others prefer Boolean data.
Continuous column transform: For continuous data such as that in Income and Age columns, a typical transform is to bin the data into buckets. For example, you may want to bin Age into five predefined agegroups. Apart from binning, techniques such as normalization are popular for transforming continuous data. Normalization maps all numerical values to a number between 0 and 1 (or –1 to 1) to ensure that largenumbers do not dominate smaller numbers during the analysis
Grouping: Sometimes there are too many distinct values (states) for a discrete column. You need to group these values into a few groups to reduce the model’s complexity. For example, the column Profession mayhave tens of different values such as Software Engineer, Telecom Engineer, Mechanical Engineer, Consultant, and so on. You can group various engineering professions by using a single value: Engineer. Groupingalso makes the model easier to interpret.
Aggregation: Aggregation is yet another important transform. Suppose that there is a table containing the telephone call detail records (CDR) for each customer, and your goal is to segment customers based on theirmonthly phone usage. Since the CDR information is too detailed for the model, you need to aggregate all the calls into a few derived attributes such as total number of calls and the average call duration. These derived attributes can later be used in the model.
Missing value handling: Most datasets contain missing values. There are a number of causes for missing data. For instance, you may have two customer tables coming from two OLTP databases. Merging these tables can result in missing values, since table definitions are not exactly the same. In another example, your customer demographic table may have a column for age. But customers don’t always like to give you this information during the registration. You may have a table of daily closing values for the stock MSFT. Because the stock market closes on weekends, there will be null values for those dates in the table. Addressing missing values is an important issue. There are a few ways to deal with this problem. You may replace the missing values with the most popular value (constant). If you don’t know a customer’s age, you can replace it with the average age of all the customers. When a record has too many missing values, you may simply remove it. For more advanced cases, you can build a mining model using those complete cases, and then apply the model to predict the most likely value for each missing case.
Removing outliers: Outliers are abnormal cases in a dataset. Abnormal cases affect the quality of a model. For example, suppose that you want to build a customer segmentation model based on customer telephoneusage (average duration, total number of calls, monthly invoice, international calls, and so on) There are a few customers (0.5%) who behave very differently. Some of these customers live aboard and use roamingall the time. If you include those abnormal cases in the model, you may end up by creating a model with majority of customers in one segment and a few other very small segments containing only these outliers.
The best way to deal with outliers is to simply remove them before the analysis. You can remove outliers based on an individual attribute; for instance, removing 0.5% customers with highest or lowest income. Youmay remove outliers based on a set of attributes. In this case, you can use a clustering algorithm. Many clustering algorithms, including Microsoft Clustering, group outliers into a few particular clusters.
There are many other data-cleaning and transformation techniques, and there are many tools available in the market. SQL Server Integration Services (SSIS) provides a set of transforms covering most of the tasks listed here.
Step 3: Model Building Once the data is cleaned and the variables are transformed, we can start to build models. Before building any model, we need to understand the goal of the data mining project and the type of the data mining task. Is this project a classification task, an association task or a segmentation task? In this stage, we need to team up with business analysts with domain knowledge. For example, if we mine telecom data, we should team up with marketing people who understand the telecom business.
Model building is the core of data mining, though it is not as time- and resource-intensive as data transformation. Once you understand the type of data mining task, it is relatively easy to pick the right algorithms. For each data mining task, there are a few suitable algorithms. In many cases, you won’t know which algorithm is the best fit for the data before model training. The accuracy of the algorithm depends on the nature of the data such as the number of states of the predictable attribute, the value distribution of each attribute, the relationships among attributes, and so on. For example, if the relationship among all input attributes and predictable attributes were linear, the decision tree algorithm would be a very good choice. If the relationships among attributes are more complicated, then the neural network algorithm should be considered.
The correct approach is to build multiple models using different algorithms and then compare the accuracy of these models using some tool, such as a lift chart, which is described in the next step. Even for the same algorithm, you may need to build multiple models using different parameter settings in order to fine-tune the model’s accuracy.
Step 4: Model Assessment In the model-building stage, we build a set of models using different algorithms and parameter settings. So what is the best model in terms of accuracy? How do you evaluate these models? There are a few popular tools to evaluate the quality of a model. The most well-known one is the lift chart. It uses a trained model to predict the values of the testing dataset. Based on the predicted value and probability, it graphically displays the model in a chart.
In the model assessment stage, not only do you use tools to evaluate the model accuracy but you also need to discuss the meaning of discovered patterns with business analysts. For example, if you build an association model on a dataset, you may find rules such as Relationship = Husband => Gender = Male with 100% confidence . Although the rule is valid, it doesn’t contain any business value. It is very important to work with business analysts who have the proper domain knowledge in order to validate the discoveries. Sometimes the model doesn’t contain useful patterns. This may occur for a couple of reasons. One is that the data is completely random. While it is possible to have random data, in most cases, real datasets do contain rich information. The second reason, which is more likely, is that the set of variables in the model is not the best one to use. You may need to repeat the data-cleaning and transformation step in order to derive more meaningful variables. Data mining is a cyclic process; it usually takes a few iterations to find the right model.
Step 5: Reporting Reporting is an important delivery channel for data mining findings. In many organizations, the goal of data miners is to deliver reports to the marketing executives. Most data mining tools have reporting features that allow users to generate predefined reports from mining models with textual or graphic outputs. There are two types of reports: reports about the findings (patterns) and reports about the prediction or forecast.
Step 6: Prediction (Scoring) In many data mining projects, finding patterns is just half of the work; the final goal is to use these models for prediction. Prediction is also called scoring in data mining terminology. To give predictions, we need to have a trained model and a set of new cases. Consider a banking scenario in which you have built a model about loan risk evaluation. Every day there are thousands of new loan applications. You can use the risk evaluation model to predict the potential risk for each of these loan applications.
Step 7: Application Integration Embedding data mining into business applications is about applying intelligence back to business, that is, closing the analysis loop. According to Gartner Research, in the next few years, more and more business applications will embed a data mining component as a value-added. For example, CRM applications may have data mining features that group customers into segments. ERP applications may have data mining features to forecast production. An online bookstore can give customers real-time recommendations on books. Integrating data mining features, especially a real-time prediction component into applications is one of the important steps of data mining projects. This is the key step for bringing data mining into mass usage.
Step 8: Model Management It is challenging to maintain the status of mining models. Each mining model has a life cycle. In some businesses, patterns are relatively stable and models don’t require frequent retraining. But in many businesses patterns vary frequently. For example, in online bookstores, new books appear every day. This means that new association rules appear every day. The duration of a mining model is limited. Anew version of the model must be created frequently. Ultimately, determining the model’s accuracy and creating new versions of the model should be accomplished by using automated processes. Like any data, mining models also have security issues. Mining models contain patterns. Many of these patterns are the summary of sensitive data. We need to maintain the read, write, and prediction rights for different user profiles. Mining models should be treated as first-class citizens in a database, where administrators can assign and revoke user access rights to these models.
Data Mining Interview Questions
Data Mining Practice Tests
List of Tutorials
- Adobe Robohelp Tutorial
- Sublime Text Tutorial
- GItlab Tutorial
- Adobe InDesign CC Tutorial
- SaltStack Tutorial
List of Topics
- Eager to Learn
- Handy Tools & Techniques
- Keep the Assets Safely
- Testing is the Religion
- Interview Questions
- RTGS Interview Questions
- Private Equity Interview Questions
- Excel Formulas Interview Questions
- Job Recommendation(Latest)
Skills by Location
- Ajax Jobs In Pune
- Android Jobs In Mumbai
- Angularjs Jobs In Bangalore
- Analytics Jobs In Hyderabad
- Animation Jobs In Bangalore
- AWS Jobs In Chennai
- Blue Prism Jobs In Pune
- Bootstrap Jobs In Bangalore
- C Jobs In Hyderabad
- Cloud Jobs In Chennai
- Codeigniter Jobs In Pune
- Db2 Jobs In Mumbai
- Designer Jobs In Gurgaon
- Devops Jobs In Hyderabad
- Docker Jobs In Bangalore
- Dreamweaver Jobs In Chennai
- EMC Storage Jobs In Mumbai
- Etl Testing Jobs In Pune
- Firmware Jobs In Mumbai
- Flash Jobs In Hyderabad
- Google Cloud Jobs In Bangalore
- Google Maps Jobs In Hyderabad
- HTML Jobs In Coimbatore
- Hadoop Jobs In Hyderabad
- Informatica Jobs In Chennai
- IOS Developer Jobs In Mumbai
- Java Developer Jobs In Pune
- Jboss Jobs In Chennai
- Jenkins Jobs In Mumbai
- Joomla Jobs In Pune
- Jquery Jobs In Hyderabad
- Kpo Jobs In Kolkata
- Linux Jobs In Chennai
- Maven Jobs In Hyderabad
- MS SQL Server Jobs In Delhi
- Networking Jobs In Noida
- Oracle DBA Jobs In Kolkata
- PHP Jobs In Chennai
- Photoshop Jobs In Hyderabad
- Puppet Jobs In Bangalore
- Python Jobs In Pune
- Qlikview Jobs In Hyderabad
- Qtp Jobs In Pune
- DBMS Jobs In Hyderabad
- Red Hat Cloud Jobs In Bangalore
- RPA Jobs In Bangalore
- Salesforce CRM Jobs In Delhi
- SAP ABAP Jobs In Gurgaon
- SAP Hybris Jobs In Bangalore
- Selenium Jobs In Karnataka
- Tableau Jobs In Bangalore
- UI Developer Jobs In Bangalore
- UI Path Jobs In Noida
- Ui/Ux Jobs In Hyderabad
- VLSI Design Jobs In Chennai
- Web Services Jobs In Delhi
- Wordpress Jobs In Bangalore
- Accountant Jobs In Hyderabad
- Admin Executive Jobs In Kolkata
- Airport Jobs In Hyderabad
- Architecture Jobs In Delhi
- AutoCAD Jobs In Chennai
- Backend Jobs In Mumbai
- Back Office Jobs In Bangalore
- BPO Jobs In Chennai
- Bank Jobs In Bangalore
- Call Center Jobs In Chennai
- Civil Engineering Jobs In Chennai
- Digital Marketing Jobs In Delhi
- Driver Jobs In Delhi
- Data Entry Jobs In Bangalore
- Electrical Jobs In Chennai
- Finance Jobs In Mumbai
- Freelancer Jobs In Hyderabad
- Front Desk Jobs In Kolkata
- Faculty Jobs In Mumbai
- Hotel Jobs In Bangalore
- HR Jobs In Noida
- Logistics Jobs In Chennai
- Marketing Jobs In Mysore
- MBA Jobs In Hyderabad
- Sales Jobs In Mumbai
- Supervisor Jobs In Delhi
- Tally Jobs In Hyderabad
- Naukri Jobs
- Monster India Jobs
- Indeed Jobs
- Advanced Search
- Freshers world
- Free job alert
- Sarkari Result
- Find all Jobs
- Browse all Jobs
- Jobs By Skills
- Jobs By Designations
- Jobs By Category
Jobs By Companies
- Adani Port Careers
- Adani Power Limited Careers
- Aditi Technologies Careers
- AEGIS Careers
- Amazon Careers
- Andhra Bank Careers
- APL Logistics Careers
- Apollo Tyres Careers
- Apple Careers
- Arvind Careers
- Ashok Leyland Careers
- Asian Paints Careers
- Aurobindo Pharma Careers
- Axis Bank Careers
- Bajaj Finserv Careers
- Bharti Airtel Careers
- Bharat Petroleum Careers
- Bhushan Steel Careers
- Biocon Careers
- Birla Careers
- Blue Star Careers
- Bosch Careers
- Brigade Group Careers
- Canara Bank Careers
- Capgemini Careers
- CESC Careers
- Cipla Careers
- CTS Careers
- Convergys Careers
- Cummins Careers
- Cyient Careers
- Dell Careers
- Dewan Housing Finance Careers
- DLF Group Careers
- EBay Inc. Careers
- EMC Careers
- Exide Careers
- Federal Bank Careers
- Future Retail Careers
- GETIT Infoservices Careers
- Gitanjali Gems Careers
- Godrej Careers
- HCL Careers
- HDFC Bank Careers
- Hero Motors Careers
- Hewlett Packard Careers
- Hexaware Careers
- Hinduja Global Solutions Careers
- Hindustan Construction Careers
- HSBC Careers
- ICICI Bank Careers
- IBM Careers
- Idea Cellular Careers
- IGATE Global Solutions Careers
- Indian Oil Careers
- Indian Overseas Bank Careers
- Indusind Bank Careers
- Infosys Careers
- Ingram Micro Careers
- Intel Careers
- ITC Careers
- Jet Airways Careers
- Jindal Careers
- Jindal Stainless Careers
- JSW Steel Careers
- Kotak Mahindra Bank Careers
- Larsen & Toubro Careers
- Lupin Careers
- MPhasis Careers
- Nestle Careers
- NTPC Careers
- Oracle Careers
- Qualcomm Careers
- Quest Global Careers
- Ramco Systems Careers
- Reliance Capital Careers
- Rolta Careers
- Sitel Careers
- Sonata Software Careers
- TCS Careers
- Tata Motors Careers
- Tech Mahindra Careers
- Thirdware Careers
- Titan Industries Careers
- Tvs Motor Careers
- Xerox Careers
- Zensar Technologies Careers
- Zoho Corp Careers
- View All Companies
- Jobs in Delhi
- Jobs in Bangalore
- Jobs in Chennai
- Jobs in Mumbai
- Jobs in Pune
- Jobs in Hyderabad
- Jobs in Kolkata
- Jobs in Chandigarh
- Jobs in Gurgaon
- Jobs in Noida
- Jobs in Ahmedabad
- Jobs in Vijayawada
Jobs in Andhra Pradesh
- Jobs in Visakhapatnam
- Jobs in Tirupati
- Jobs in Guntur
- Jobs in Rajahmundry
Jobs in Assam
- Jobs in Dibrugarh
- Jobs in Guwahati
- Jobs in Silchar
- Jobs in Tezpur
Jobs in Chhattisgarh
- Jobs in Raipur
- Jobs in Raigarh
- Jobs in Korba
- Jobs in Bilaspur
Jobs in Gujarat
- Jobs in Rajkot
- Jobs in Gandhinagar
- Jobs in Surat
- Jobs in Vadodara
Jobs in Haryana
- Jobs in Ambala
- Jobs in Faridabad
- Jobs in Dharuhera
- Jobs in Hisar
Jobs in Jharkhand
- Jobs in Bokaro
- Jobs in Dhanbad
- Jobs in Jamshedpur
- Jobs in Ranchi
Jobs in Kerala
- Jobs in Trivandrum
- Jobs in Kottayam
- Jobs in Ernakulam
- Jobs in Kollam
Jobs in Karnataka
- Jobs in Bidar
- Jobs in Mysore
- Jobs in Gulbarga
- Jobs in Mangalore
Jobs in Uttarakhand
- Jobs in Dehradun
- Jobs in Haridwar
- Jobs in Kashipur
- Jobs in Pantnagar
Jobs in Madhya Pradesh
- Jobs in Bhopal
- Jobs in Indore
- Jobs in Jabalpur
- Jobs in Ujjain
Jobs in Odisha
- Jobs in Bhubaneshwar
- Jobs in Cuttack
- Jobs in Puri
- Jobs in Jharsuguda
Jobs in Rajasthan
- Jobs in Jodhpur
- Jobs in Jaipur
- Jobs in Mount Abu
- Jobs in Udaipur
Jobs in Punjab
- Jobs in Amritsar
- Jobs in Patiala
- Jobs in Rajpura
- Jobs in Mohali
Jobs in Tamil Nadu
- Jobs in Kanchipuram
- Jobs in Madurai
- Jobs in Ooty
- Jobs in Coimbatore
Jobs in Telangana
- Jobs in Nizamabad
- Jobs in Khammam
- Jobs in Karimnagar
- Jobs in Warangal
Jobs in Uttar Pradesh
- Jobs in Lucknow
- Jobs in Allahabad
- Jobs in Kanpur
- Jobs in Varanasi
Jobs in West Bengal
- Jobs in Kharagpur
- Jobs in Burdwan
- Jobs in Haldia
- Jobs in Siliguri
Jobs in Maharashtra
- Jobs in Nasik
- Jobs in Solapur
- Jobs in Navi Mumbai
- Jobs in Nagpur
Jobs in Himachal Pradesh
- Jobs in Shimla
- Jobs in Chamba
- Jobs in Dalhousie
- Jobs in Dharmasala
Jobs in Jammu Kashmir
- Jobs in Jammu
- Jobs in Srinagar
Jobs in Meghalaya
- Jobs in Shillong
Jobs in Goa
- Jobs in Panjim
- Jobs in Vasco Da Gama
Jobs in Nagaland
- Jobs in Dimapur
- Jobs in Kohima
- View All Locations
State Govt Jobs
- AP Government Jobs
- Bihar Government Jobs
- Delhi Government Jobs
- Gujarat Government Jobs
- Jharkhand Government Jobs
- Karnataka Government Jobs
- Kerala Government Jobs
- Maharashtra Government Jobs
- Orissa Government Jobs
- Punjab Government Jobs
- Rajasthan Government Jobs
- Tamilnadu Government Jobs
- Telangana Government Jobs
- UP Government Jobs
- West Bengal Government Jobs
- View All Government Jobs
- Allahabad bank Jobs
- Axis Bank Jobs
- Bank of Baroda Jobs
- Bank Of India Jobs
- Bank Of Maharashtra Jobs
- Canara Bank Jobs
- Corporation Bank Jobs
- Cosmos Bank Jobs
- Federal Bank Jobs
- HDFC Bank Jobs
- ICICI Bank Jobs
- IDBI Bank Jobs
- Karur Vysya Bank Jobs
- Reserve Bank of India Jobs
- State Bank of India Jobs
- View All Banks Jobs
- Indian Army Jobs
- Indian Air Force Jobs
- Indian Navy Jobs
- Police Jobs
- AIATSL Recruitment Jobs
- Bihar Police Recruitment Jobs
- BSF Constable Recruitment Jobs
- CISF Recruitment Jobs
- HAL Recruitment Jobs
- Manipur Police Recruitment Jobs
- MTS Air Force Jobs
- UP Police Recruitment Jobs
- View All Defence Jobs
- View All PSC
- Bihar SSC Recruitment Jobs
- SSC Central Jobs
- State Wise SSC Posts Jobs
- OSSC Recruitment Jobs
- HPSSC Recruitment Jobs
- HSSC Recruitment Jobs
- SSC Allahabad Recruitment Jobs
- SSC Assistant Grade Jobs
- SSC Constable Recruitment Jobs
- SSC ER Recruitment Jobs
- SSC Karnataka Recruitment Jobs
- SSC Kerala Recruitment Jobs
- SSC NWR Recrutment Jobs
- SSC WR Recruitment Jobs
- SSCMPR Recruitment Jobs
- View All SSC Jobs
- Central Railway Recruitment Jobs
- CLW Recruitment Jobs
- CRWC Recruitment Jobs
- DLW Recruitment Jobs
- Eastern Railway Recruitment Jobs
- Gujarat Metro Rail Recruitment Jobs
- ICRON Recruitment Jobs
- IRSDC Recruitment Jobs
- Kochi Metro Rail Recruitment Jobs
- Mumbai Metro Rail Recruitment jobs
- RRB Ahmedabad Recruitment jobs
- RRB Secunderabad Recruitment jobs
- RRC Central Railway Recruitment jobs
- RVNL Recruitment Jobs
- View All Railway Jobs
- Today walkins
- Tomorrow walkins
- Latest Walkins
- Fresher Walkins
Walkins by Skill
- Bpo Walkins In Bangalore
- Bpo Walkins In Chennai
- Bpo Walkins In Gurgaon
- Bpo Walkins In Hyderabad
- Bpo Walkins In Mumbai
- Bpo Walkins In Pune
- Dot Net Walkins In Bangalore
- Dot Net Walkins In Chennai
- Dot Net Walkins In Gurgaon
- Dot Net Walkins In Hyderabad
- Dot Net Walkins In Pune
- Java Walkins In Bangalore
- Java Walkins In Chennai
- Java Walkins In Gurgaon
- Java Walkins In Hyderabad
- Java Walkins In Kolkata
- Java Walkins In Mumbai
- Java Walkins In Pune
- Php Walkins In Bangalore
- Php Walkins In Chennai
- Php Walkins In Gurgaon
- Php Walkins In Hyderabad
- Php Walkins In Mumbai
- Php Walkins In Pune
Walkins by location
- Walkins In Ahmedabad
- Walkins In Bangalore
- Walkins In Chandigarh
- Walkins In Chennai
- Walkins In Delhi
- Walkins In Gurgaon
- Walkins In Guwahati
- Walkins In Hyderabad
- Walkins In Jaipur
- Walkins In Kanpur
- Walkins In Kolkata
- Walkins In Mumbai
- Walkins In Noida
- Walkins In Patna
- Walkins In Pune
- Walkins In Surat
- Today Walkins In Ahmedabad
- Today Walkins In Bangalore
- Today Walkins In Chandigarh
- Today Walkins In Chennai
- Today Walkins In Delhi
- Today Walkins In Gurgaon
- Today Walkins In Guwahati
- Today Walkins In Hyderabad
- Today Walkins In Jaipur
- Today Walkins In Kanpur
- Today Walkins In Kolkata
- Today Walkins In Mumbai
- Today Walkins In Noida
- Today Walkins In Patna
- Today Walkins In Pune
- Today Walkins In Surat
Walkins by Company
- Amazon Walkins In Delhi
- Google Walkins In Noida
- Google Walkins In Mumbai
- HSBC Walkins In Chennai
- HSBC Walkins In Mumbai
- HSBC Walkins In Hyderabad
- HSBC Walkins In Bangalore
- HCL Walkins In Mumbai
- HCL Walkins In Hyderabad
- HCL Walkins In Bangalore
- IBM Walkins In Bangalore
- IBM Walkins In Chennai
- IBM Walkins In Hyderabad
- Infosys Walkins In Bangalore
- Infosys Walkins In Chennai
- Infosys Walkins In Mumbai
- TCS Walkins In Bangalore
- TCS Walkins In Hyderabad
- TCS Walkins In Mumbai
- Tech Mahindra Walkins In Chandigarh
- Tech Mahindra Walkins In Chennai
- Tech Mahindra Walkins In Delhi
- Tech Mahindra Walkins In Gurgaon
- Tech Mahindra Walkins In Hyderabad
- Tech Mahindra Walkins In Noida
- Tech Mahindra Walkins In Pune
- View All Walkins
- Tutorial Interview Questions Practice Test '>Adobe Flex Tutorial
- Tutorial Interview Questions Practice Test '>Adv Java Tutorial
- Tutorial Interview Questions Practice Test '>Agile Testing Tutorial
- Tutorial Interview Questions Practice Test '>Ajax Tutorial
- Tutorial Interview Questions Practice Test '>Android Tutorial
- Tutorial Interview Questions Practice Test '>Apex Tutorial
- Tutorial Interview Questions Practice Test '>Asp.net Tutorial
- Tutorial Interview Questions Practice Test '>Blackberry Tutorial
- Tutorial Interview Questions Practice Test '>Bootstrap Tutorial
- Tutorial Interview Questions Practice Test '>C Tutorial
- Tutorial Interview Questions Practice Test '>C++ Tutorial
- Tutorial Interview Questions Practice Test '>CakePHP Tutorial
- Tutorial Interview Questions Practice Test '>COBOL Tutorial
- Tutorial Interview Questions Practice Test '>Codeigniter Tutorial
- Tutorial Interview Questions Practice Test '>Core Java Tutorial
- Tutorial Interview Questions Practice Test '>Css3 Tutorial
- Tutorial Interview Questions Practice Test '>Data Mining Tutorial
- Tutorial Interview Questions Practice Test '>Drupal Tutorial
- Tutorial Interview Questions Practice Test '>ERP Tools Tutorial
- Tutorial Interview Questions Practice Test '>Hadoop Tutorial
- Tutorial Interview Questions Practice Test '>Html Tutorial
- Tutorial Interview Questions Practice Test '>Ibm - As/400 Tutorial
- Tutorial Interview Questions Practice Test '>IBM Cognos Tutorial
- Tutorial Interview Questions Practice Test '>IOS Tutorial
- Tutorial Interview Questions Practice Test '>J Query Tutorial
- Tutorial Interview Questions Practice Test '>Java Tutorial
- Tutorial Interview Questions Practice Test '>Java Script Tutorial
- Tutorial Interview Questions Practice Test '>JBOSS Tutorial
- Tutorial Interview Questions Practice Test '>JDBC Tutorial
- Tutorial Interview Questions Practice Test '>JMeter Tutorial
- Tutorial Interview Questions Practice Test '>Joomla Tutorial
- Tutorial Interview Questions Practice Test '>Linux Tutorial
- Tutorial Interview Questions Practice Test '>LoadRunner Tutorial
- Tutorial Interview Questions Practice Test '>Maven Tutorial
- Tutorial Interview Questions Practice Test '>MS Azure Tutorial
- Tutorial Interview Questions Practice Test '>Mysql Tutorial
- Tutorial Interview Questions Practice Test '>Networking Tutorial
- Tutorial Interview Questions Practice Test '>Node.js Tutorial
- Tutorial Interview Questions Practice Test '>OBIEE Tutorial
- Tutorial Interview Questions Practice Test '>OLAP Tutorial
- Tutorial Interview Questions Practice Test '>Oracle Tutorial
- Tutorial Interview Questions Practice Test '>PHP Tutorial
- Tutorial Interview Questions Practice Test '>Python Tutorial
- Tutorial Interview Questions Practice Test '>Qlik View Tutorial
- Tutorial Interview Questions Practice Test '>QTP Tutorial
- Tutorial Interview Questions Practice Test '>ReactJS Tutorial
- Tutorial Interview Questions Practice Test '>Sap BI Tutorial
- Tutorial Interview Questions Practice Test '>Sap Hr Tutorial
- Tutorial Interview Questions Practice Test '>Scrum Tutorial
- Tutorial Interview Questions Practice Test '>Scala Tutorial
- Tutorial Interview Questions Practice Test '>Selenium Tutorial
- Tutorial Interview Questions Practice Test '>T-SQL Tutorial
- Tutorial Interview Questions Practice Test '>Teradata Tutorial
- Tutorial Interview Questions Practice Test '>Testing Tools Tutorial
- Tutorial Interview Questions Practice Test '>VSAM Tutorial
- Tutorial Interview Questions Practice Test '>WiMAX Tutorial
- Tutorial Interview Questions Practice Test '>Advertising Management Tutorial
- Tutorial Interview Questions Practice Test '>Artificial Intelligence Tutorial
- Tutorial Interview Questions Practice Test '>Business Analyst Tutorial
- Tutorial Interview Questions Practice Test '>Business Environment Tutorial
- Tutorial Interview Questions Practice Test '>Consumer Behaviour Tutorial
- Tutorial Interview Questions '>Critical Thinking Tutorial
- Tutorial Interview Questions Practice Test '>Customer Relationship Management Tutorial
- Tutorial Interview Questions Practice Test '>E-commerce Concepts Tutorial
- Tutorial Interview Questions Practice Test '>Food Resources Manual Tutorial
- Tutorial Interview Questions Practice Test '>Forex Management Tutorial
- Tutorial Interview Questions Practice Test '>Global Money Markets Tutorial
- Tutorial Interview Questions Practice Test '>Statistics Tutorial
- Tutorial Interview Questions Practice Test '>Hotel Front Office Management Tutorial
- Tutorial Interview Questions Practice Test '>Industrial Relations Management Tutorial
- Tutorial Interview Questions Practice Test '>ITIL Configuration Management Tutorial
- Tutorial Interview Questions Practice Test '>Management Hotel Tutorial
- Tutorial Interview Questions Practice Test '>Managerial Economics Tutorial
- Tutorial Interview Questions Practice Test '>Marketing Management Tutorial
- Tutorial Interview Questions Practice Test '>Marketing Research Tutorial
- Tutorial Interview Questions Practice Test '>Organisational Behaviour Tutorial
- Tutorial Interview Questions Practice Test '>Payroll Management Tutorial
- Tutorial Interview Questions Practice Test '>Patent Law Tutorial
- Tutorial Interview Questions Practice Test '>Principles Of Management Tutorial
- Tutorial Interview Questions Practice Test '>Principles Of Service Marketing Management Tutorial
- Tutorial Interview Questions Practice Test '>Project Management Tutorial
- Tutorial Interview Questions Practice Test '>Production And Operations Management Tutorial
- Tutorial Interview Questions Practice Test '>Quantitative Techniques Tutorial
- Tutorial Interview Questions Practice Test '>Quality Management Tutorial
- Tutorial Interview Questions Practice Test '>Research Methodology Tutorial
- Tutorial Interview Questions Practice Test '>Sales Management Tutorial
- Tutorial Interview Questions Practice Test '>Strategic Management Tutorial
- Tutorial Interview Questions Practice Test '>Working Capital Management Tutorial
- Tutorial Interview Questions Practice Test '>Business Communications Tutorial
- Tutorial Interview Questions '>Principles Of Communication Tutorial
- Tutorial Interview Questions Practice Test '>Business Ethics Tutorial
- Tutorial Interview Questions Practice Test '>Change Management Tutorial
- Tutorial Interview Questions Practice Test '>Marketing Strategy Tutorial
- Tutorial Interview Questions '>Sales Forecasting Tutorial
Digital Marketing Skills
- Tutorial Interview Questions Practice Test '>Digital Marketing Tutorial
- Tutorial Interview Questions '>Mobile Marketing Tutorial
- Tutorial Interview Questions Practice Test '>Pay Per Click (ppc) Tutorial
- Tutorial Interview Questions '>Social Media Marketing Tutorial
Human Resources Skills
- Tutorial Interview Questions Practice Test '>Hr Management Tutorial
- Tutorial Interview Questions Practice Test '>Training And Development Tutorial
Health Care Skills
- Tutorial Interview Questions Practice Test '>Medical Terminology(adaptive*) Tutorial
- Tutorial Interview Questions Practice Test '>Pharmacology Tutorial
- Tutorial Interview Questions Practice Test '>Accounts And Finance For Managers Tutorial
- Tutorial Interview Questions Practice Test '>Business Management For Financial Advisers Tutorial
- Tutorial Interview Questions Practice Test '> Financial Management Tutorial
- Tutorial Interview Questions Practice Test '>Financial Reporting And Analysis Tutorial
- Tutorial Interview Questions Practice Test '>Financial Services Marketing Tutorial
- Tutorial Interview Questions Practice Test '>Modern Banking Tutorial
- Tutorial Interview Questions Practice Test '>Tally Tutorial
All Practice Tests
Resume Writing Tips
- Tips to revamp your tech resume
- 7 Step guide to post your resume online
- Challenge of Resume Preparation for Freshers
- Tips for formatting your resume
- Have a Short and Attention Grabbing Resume
- Do you have employment gaps in your resume?
- Making a great Resume: Get the basics right
- Resume Tips, Resume Advice
- How to get right job with right resume?
- How to design your resume?
- Have you ever lie on your resume? Read This
- Tips for writing resume in slowdown
- What do employers look for in a resume?
- 21 Resume tips for a killer resume
- Resume tips for techies
- 5 ways to be authentic in an interview
- Tips to help you face your job interview
- Top 10 commonly asked BPO Interview questions
- 5 things you should never talk in any job interview
- 2018 Best job interview tips for job seekers
- 7 Tips to recruit the right candidates in 2018
- 5 Important interview questions techies fumble most
- What are avoidable questions in an Interview?
- Top 4 tips to help you get hired as a receptionist
- 8 things ever to say in a job interview
- 5 Tips to Overcome Fumble During an Interview
- How to Overcome Pre-interview Jitters
- What Not to Do in a Job Interview?
- 8 Mock Interview Questions for Freshers
- How to face Telephone Interview?
- The impact of GST on job creation
- How Can Freshers Keep Their Job Search Going?
- How to Convert Your Internship into a Full Time Job?
- 5 Top Career Tips to Get Ready for a Virtual Job Fair
- Smart tips to succeed in virtual job fairs
- Why Email Marketing?
- Top 10 facts why you need a cover letter?
- 6 things to remember for Eid celebrations
- 9 ways to get succeed in job search
- 5 ways to turn your internship in a job
- 7 job search tips during Ramadan
- Top 5 GCC jobs of the future
- Most popular women in Tech History
- Blind Hiring: 2018 Recruitment trend
- 3 Golden rules to optimize your job search
- Union Budget 2018 Highlights
- Online hiring saw 14% rise in November: Report
- Hiring Activities Saw Growth in March: Report
- Attrition rate dips in corporate India: Survey
- 2016 Most Productive year for Staffing: Study
- The impact of Demonetization across sectors
- Most important skills required to get hired
- How startups are innovating with interview formats
- Does chemistry workout in job interviews?
- 15 signs your job interview is going horribly
- Overview of IT/ITes sector
- Time to Expand NBFCs: Rise in Demand for Talent
- Here's how to train middle managers
- This is how banks are wooing startups
- Nokia to cut thousands of jobs
- Our Portals :
- Canada Jobs
- South Africa Jobs
- Malaysia Jobs
- Singapore Jobs
- Australia Jobs
- New Zealand Jobs
Wisdomjobs.com is one of the best job search sites in India.
- 3,24,69,003 Resumes Uploaded
- 16,70,393 Jobs Available
- 1,32,30,521 Assessments taken
Data Mining Project Cycle
What is the life cycle of a data mining project? What are the challenging steps?
Who should be involved in a data mining project? To answer these questions, let’s go over a typical data mining project step by step.
Step 1: Data Collection
The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied. For instance, if you want to analyze the Web click stream and your company has a dozen Web servers, the first step is to download the Web log data from each Web server.
Sometimes you might be lucky. The data warehouse on the subject of your analysis already exists. However, the data in the data warehouse may not be rich enough. You may still need to gather data from other sources. Suppose that there is a click stream data warehouse containing all the Web clicks on the Web site of your company. You have basic information about customers’ navi-gation patterns. However, because there is not much demographic informa-tion about your Web visitors, you may need to purchase or gather some demographic data from other sources in order to build a more accurate model.
After the data is collected, you can sample the data to reduce the volume of the training dataset. In many cases, the patterns contained in 50,000 customers are the same as in 1 million customers.
Step 2: Data Cleaning and Transformation
Data cleaning and transformation is the most resource-intensive step in a data mining project. The purpose of data cleaning is to remove noise and irrelevant information out of the dataset. The purpose of data transformation is to mod-ify the source data into different formats in terms of data types and values.
There are various techniques you can apply to data cleaning and transforma-tion, including:
Data type transform: This is the simplest data transform. An example is transforming a Boolean column type to integer. The reason for this trans-form is that some data mining algorithms pertrans-form better on integer data, while others prefer Boolean data.
Continuous column transform: For continuous data such as that in Income and Age columns, a typical transform is to bin the data into
buckets. For example, you may want to bin Age into five predefined age groups. Apart from binning, techniques such as normalization are popu-lar for transforming continuous data. Normalization maps all numerical values to a number between 0 and 1 (or –1 to 1) to ensure that large numbers do not dominate smaller numbers during the analysis.
Grouping: Sometimes there are too many distinct values (states) for a discrete column. You need to group these values into a few groups to reduce the model’s complexity. For example, the column Profession may have tens of different values such as Software Engineer, Telecom Engi-neer, Mechanical EngiEngi-neer, Consultant, and so on. You can group vari-ous engineering professions by using a single value: Engineer. Grouping also makes the model easier to interpret.
Aggregation: Aggregation is yet another important transform. Suppose that there is a table containing the telephone call detail records (CDR) for each customer, and your goal is to segment customers based on their monthly phone usage. Since the CDR information is too detailed for the model, you need to aggregate all the calls into a few derived attributes such as total number of calls and the average call duration. These derived attributes can later be used in the model.
Missing value handling: Most datasets contain missing values. There are a number of causes for missing data. For instance, you may have two customer tables coming from two OLTP databases. Merging these tables can result in missing values, since table definitions are not exactly the same. In another example, your customer demographic table may have a column for age. But customers don’t always like to give you this infor-mation during the registration. You may have a table of daily closing values for the stock MSFT. Because the stock market closes on weekends, there will be null values for those dates in the table. Addressing missing values is an important issue. There are a few ways to deal with this problem. You may replace the missing values with the most popular value (constant). If you don’t know a customer’s age, you can replace it with the average age of all the customers. When a record has too many missing values, you may simply remove it. For more advanced cases, you can build a mining model using those complete cases, and then apply the model to predict the most likely value for each missing case.
Removing outliers: Outliers are abnormal cases in a dataset. Abnormal cases affect the quality of a model. For example, suppose that you want to build a customer segmentation model based on customer telephone usage (average duration, total number of calls, monthly invoice, interna-tional calls, and so on) There are a few customers (0.5%) who behave
very differently. Some of these customers live aboard and use roaming all the time. If you include those abnormal cases in the model, you may end up by creating a model with majority of customers in one segment and a few other very small segments containing only these outliers.
The best way to deal with outliers is to simply remove them before the analysis. You can remove outliers based on an individual attribute; for instance, removing 0.5% customers with highest or lowest income. You may remove outliers based on a set of attributes. In this case, you can use a clustering algorithm. Many clustering algorithms, including Microsoft Clustering, group outliers into a few particular clusters.
There are many other data-cleaning and transformation techniques, and there are many tools available in the market. SQL Server Integration Services (SSIS) provides a set of transforms covering most of the tasks listed here.
Step 3: Model Building
Once the data is cleaned and the variables are transformed, we can start to build models. Before building any model, we need to understand the goal of the data mining project and the type of the data mining task. Is this project a classification task, an association task or a segmentation task? In this stage, we need to team up with business analysts with domain knowledge. For example, if we mine telecom data, we should team up with marketing people who understand the telecom business.
Model building is the core of data mining, though it is not as time- and resource-intensive as data transformation. Once you understand the type of data mining task, it is relatively easy to pick the right algorithms. For each data mining task, there are a few suitable algorithms. In many cases, you won’t know which algorithm is the best fit for the data before model training. The accuracy of the algorithm depends on the nature of the data such as the num-ber of states of the predictable attribute, the value distribution of each attribute, the relationships among attributes, and so on. For example, if the relationship among all input attributes and predictable attributes were linear, the decision tree algorithm would be a very good choice. If the relationships among attrib-utes are more complicated, then the neural network algorithm should be considered.
The correct approach is to build multiple models using different algorithms and then compare the accuracy of these models using some tool, such as a lift chart, which is described in the next step. Even for the same algorithm, you may need to build multiple models using different parameter settings in order to fine-tune the model’s accuracy.
Step 4: Model Assessment
In the model-building stage, we build a set of models using different algo-rithms and parameter settings. So what is the best model in terms of accuracy?
How do you evaluate these models? There are a few popular tools to evaluate the quality of a model. The most well-known one is the lift chart. It uses a trained model to predict the values of the testing dataset. Based on the pre-dicted value and probability, it graphically displays the model in a chart. We will give a better description of lift charts in Chapter 3.
In the model assessment stage, not only do you use tools to evaluate the model accuracy but you also need to discuss the meaning of discovered pat-terns with business analysts. For example, if you build an association model on a dataset, you may find rules such as Relationship = Husband => Gender = Male with 100% confidence. Although the rule is valid, it doesn’t contain any business value. It is very important to work with business analysts who have the proper domain knowledge in order to validate the discoveries.
Sometimes the model doesn’t contain useful patterns. This may occur for a couple of reasons. One is that the data is completely random. While it is possi-ble to have random data, in most cases, real datasets do contain rich informa-tion. The second reason, which is more likely, is that the set of variables in the model is not the best one to use. You may need to repeat the data-cleaning and transformation step in order to derive more meaningful variables. Data min-ing is a cyclic process; it usually takes a few iterations to find the right model.
Step 5: Reporting
Reporting is an important delivery channel for data mining findings. In many organizations, the goal of data miners is to deliver reports to the marketing executives. Most data mining tools have reporting features that allow users to generate predefined reports from mining models with textual or graphic out-puts. There are two types of reports: reports about the findings (patterns) and reports about the prediction or forecast.
Step 6: Prediction (Scoring)
In many data mining projects, finding patterns is just half of the work; the final goal is to use these models for prediction. Prediction is also called scoring in data mining terminology. To give predictions, we need to have a trained model and a set of new cases. Consider a banking scenario in which you have built a model about loan risk evaluation. Every day there are thousands of new loan applications. You can use the risk evaluation model to predict the potential risk for each of these loan applications.
Step 7: Application Integration
Embedding data mining into business applications is about applying intelli-gence back to business, that is, closing the analysis loop. According to Gartner Research, in the next few years, more and more business applications will embed a data mining component as a value-added. For example, CRM appli-cations may have data mining features that group customers into segments.
ERP applications may have data mining features to forecast production. An online bookstore can give customers real-time recommendations on books.
Integrating data mining features, especially a real-time prediction component into applications is one of the important steps of data mining projects. This is the key step for bringing data mining into mass usage.
Step 8: Model Management
It is challenging to maintain the status of mining models. Each mining model has a life cycle. In some businesses, patterns are relatively stable and models don’t require frequent retraining. But in many businesses patterns vary fre-quently. For example, in online bookstores, new books appear every day. This means that new association rules appear every day. The duration of a mining model is limited. A new version of the model must be created frequently. Ulti-mately, determining the model’s accuracy and creating new versions of the model should be accomplished by using automated processes.
Like any data, mining models also have security issues. Mining models con-tain patterns. Many of these patterns are the summary of sensitive data. We need to maintain the read, write, and prediction rights for different user pro-files. Mining models should be treated as first-class citizens in a database, where administrators can assign and revoke user access rights to these models.
- What Is Data Mining
- Data Mining Tasks
- Data Mining Project Cycle (You are here )
- Predictive Model Markup Language
- UNDERSTANDING NESTED KEYS
- RETURNING HIERARCHICAL ROWSETS
- The Mining_Model_Content Schema Rowset
- Offline Mode and Immediate Mode
- Using the Data Source View
- Using the Data Mining Wizard
- Using the Data Mining Designer
- Using the Mining Accuracy Chart
- Exploring a Naïve Bayes Model
- Association Analysis with Microsoft Decision Trees
- DMX Queries
- Interpreting the Model
- Data Structure & Algorithm Classes (Live)
- System Design (Live)
- Explore More Live Courses
- Interview Preparation Course
- Data Science (Live)
- GATE CS & IT 2024
- Data Structure & Algorithm-Self Paced(C++/JAVA)
- Data Structures & Algorithms in Python
- Explore More Self-Paced Courses
- C++ Programming - Beginner to Advanced
- Java Programming - Beginner to Advanced
- C Programming - Beginner to Advanced
- Android App Development with Kotlin(Live)
- Full Stack Development with React & Node JS(Live)
- Java Backend Development(Live)
- React JS (Basic to Advanced)
- Complete Data Science Program(Live)
- Mastering Data Analytics
- CBSE Class 12 Computer Science
- School Guide
- All Courses
- Linked List
- Binary Tree
- Binary Search Tree
- Advanced Data Structure
- All Data Structures
- Asymptotic Analysis
- Worst, Average and Best Cases
- Asymptotic Notations
- Little o and little omega notations
- Lower and Upper Bound Theory
- Analysis of Loops
- Solving Recurrences
- Amortized Analysis
- What does 'Space Complexity' mean ?
- Pseudo-polynomial Algorithms
- Polynomial Time Approximation Scheme
- A Time Complexity Question
- Searching Algorithms
- Sorting Algorithms
- Graph Algorithms
- Pattern Searching
- Geometric Algorithms
- Bitwise Algorithms
- Randomized Algorithms
- Greedy Algorithms
- Dynamic Programming
- Divide and Conquer
- Branch and Bound
- All Algorithms
- Company Preparation
- Practice Company Questions
- Interview Experiences
- Experienced Interviews
- Internship Interviews
- Competitive Programming
- Design Patterns
- System Design Tutorial
- Multiple Choice Quizzes
- Go Language
- Tailwind CSS
- Foundation CSS
- Materialize CSS
- Semantic UI
- Angular PrimeNG
- Angular ngx Bootstrap
- jQuery Mobile
- jQuery EasyUI
- React Bootstrap
- React Rebass
- React Desktop
- React Suite
- ReactJS Evergreen
- ReactJS Reactstrap
- English Grammar
- School Programming
- Number System
- Class 8 Syllabus
- Class 9 Syllabus
- Class 10 Syllabus
- Class 8 Notes
- Class 9 Notes
- Class 10 Notes
- Class 11 Notes
- Class 12 Notes
- Class 8 Maths Solution
- Class 9 Maths Solution
- Class 10 Maths Solution
- Class 11 Maths Solution
- Class 12 Maths Solution
- Class 7 Notes
- History Class 7
- History Class 8
- History Class 9
- Geo. Class 7
- Geo. Class 8
- Geo. Class 9
- Civics Class 7
- Civics Class 8
- Business Studies (Class 11th)
- Microeconomics (Class 11th)
- Statistics for Economics (Class 11th)
- Business Studies (Class 12th)
- Accountancy (Class 12th)
- Macroeconomics (Class 12th)
- Machine Learning
- Data Science
- Operating System
- Computer Networks
- Computer Organization and Architecture
- Theory of Computation
- Compiler Design
- Digital Logic
- Software Engineering
- GATE 2024 Live Course
- GATE Computer Science Notes
- Last Minute Notes
- GATE CS Solved Papers
- GATE CS Original Papers and Official Keys
- GATE CS 2023 Syllabus
- Important Topics for GATE CS
- GATE 2023 Important Dates
- Software Design Patterns
- HTML Cheat Sheet
- CSS Cheat Sheet
- Bootstrap Cheat Sheet
- JS Cheat Sheet
- jQuery Cheat Sheet
- Angular Cheat Sheet
- Facebook SDE Sheet
- Amazon SDE Sheet
- Apple SDE Sheet
- Netflix SDE Sheet
- Google SDE Sheet
- Wipro Coding Sheet
- Infosys Coding Sheet
- TCS Coding Sheet
- Cognizant Coding Sheet
- HCL Coding Sheet
- FAANG Coding Sheet
- Love Babbar Sheet
- Mass Recruiter Sheet
- Product-Based Coding Sheet
- Company-Wise Preparation Sheet
- Array Sheet
- String Sheet
- Graph Sheet
- ISRO CS Original Papers and Official Keys
- ISRO CS Solved Papers
- ISRO CS Syllabus for Scientist/Engineer Exam
- UGC NET CS Notes Paper II
- UGC NET CS Notes Paper III
- UGC NET CS Solved Papers
- Campus Ambassador Program
- School Ambassador Program
- Geek of the Month
- Campus Geek of the Month
- Placement Course
- Student Chapter
- Geek on the Top
- Geography Notes
- History Notes
- Science & Tech. Notes
- Ethics Notes
- Polity Notes
- Economics Notes
- UPSC Previous Year Papers
- SSC CGL Syllabus
- General Studies
- Subjectwise Practice Papers
- Previous Year Papers
- SBI Clerk Syllabus
- General Awareness
- Quantitative Aptitude
- Reasoning Ability
- SBI Clerk Practice Papers
- SBI PO Syllabus
- SBI PO Practice Papers
- IBPS PO 2022 Syllabus
- English Notes
- Reasoning Notes
- Mock Question Papers
- IBPS Clerk Syllabus
- Apply for a Job
- Apply through Jobathon
- Hire through Jobathon
- All DSA Problems
- Problem of the Day
- GFG SDE Sheet
- Top 50 Array Problems
- Top 50 String Problems
- Top 50 Tree Problems
- Top 50 Graph Problems
- Top 50 DP Problems
- Solving For India-Hackthon
- GFG Weekly Coding Contest
- Job-A-Thon: Hiring Challenge
- BiWizard School Contest
- All Contests and Events
- Saved Videos
- What's New ?
- Data Structures
- Interview Preparation
- Topic-wise Practice
- Latest Blogs
- Write & Earn
- Web Development
- Write Articles
- Pick Topics to write
- Guidelines to Write
- Get Technical Writing Internship
- Write an Interview Experience
- SQL | Join (Inner, Left, Right and Full Joins)
- ACID Properties in DBMS
- SQL query to find second highest salary?
- SQL | WITH clause
- Normal Forms in DBMS
- SQL Trigger | Student Database
- Introduction of DBMS (Database Management System) | Set 1
- Introduction of ER Model
- Introduction of B-Tree
- SQL | GROUP BY
- Commonly asked DBMS interview questions
- SQL | Views
- Difference between Primary Key and Foreign Key
- Difference between Clustered and Non-clustered index
- Data Preprocessing in Data Mining
- Difference between DDL and DML in DBMS
- SQL - ORDER BY
- Indexing in Databases | Set 1
- Difference between DELETE, DROP and TRUNCATE
- Types of Keys in Relational Model (Candidate, Super, Primary, Alternate and Foreign)
- SQL Interview Questions
- Difference between Primary key and Unique key
- Difference between SQL and NoSQL
- Structured Query Language (SQL)
- Third Normal Form (3NF)
- Difference between DELETE and TRUNCATE
- Introduction of 3-Tier Architecture in DBMS | Set 2
- Second Normal Form (2NF)
- SQL | INSERT INTO Statement
Traditional Data Mining Life Cycle (Crisp Methodology)
- Last Updated : 29 Aug, 2022
Prerequisite – Data Mining Traditional Data Mining Life Cycle: The data life cycle is the arrangement of stages that a specific unit of information goes through from its starting era or capture to its possible documented and/or cancellation at the conclusion of its valuable life. This cycle has shallow likenesses with the more conventional information mining cycle as depicted in Crisp methodology. Steps Traditional Data Mining Life Cycle:
- Business Understanding: This introductory stage centers on understanding the extend destinations and prerequisites from a commerce point of view, and after that changing over this information into a information mining issue definition. A preliminary arrange is planned to attain the destinations. A choice show, particularly one built utilizing the Choice Demonstrate and Documentation standard can be utilized.
- Data Understanding: The information understanding stage begins with an starting information collection and continues with exercises in arrange to induce commonplace with the information, to distinguish information quality issues, to find to begin with experiences into the information, or to distinguish curiously subsets to create speculations for covered up data.
- Information arrangement: Information arrangement errands are likely to be performed numerous times, and not in any endorsed arrange. Assignments incorporate table, record, and trait choice as well as change and cleaning of information for modeling apparatuses.
- Modeling: In this stage, different modeling strategies are chosen and connected and their parameters are calibrated to ideal values. Regularly, there are a few procedures for the same information mining issue sort. A few procedures have particular prerequisites on the frame of information. Subsequently, it is frequently required to step back to the information planning stage.
- Evaluation: At this organize within the extend, you’ve got built a demonstrate (or models) that shows up to have tall quality, from a information examination viewpoint. Some time recently proceeding to last sending of the show, it is vital to assess the demonstrate altogether and audit the steps executed to build the show, to be certain it legitimately accomplishes the trade objectives. A key objective is to decide on the off chance that there’s a few vital commerce issue that has not been adequately considered. At the conclusion of this stage, a choice on the utilize of the information mining comes about ought to be come to.
- Deployment: Creation of the show is by and large not the conclusion of the extend. Indeed on the off chance that the reason of the demonstrate is to extend information of the information, the knowledge gained will have to be organized and displayed in a way that’s valuable to the customer. Depending on the prerequisites, the sending stage can be as straightforward as creating a report or as complex as executing a repeatable information scoring (e.g. section assignment) or information mining prepare.
Please Login to comment...
- data mining
- Data MIning
Improve your Coding Skills with Practice
Start your coding journey now.
This browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
The Team Data Science Process lifecycle
- 2 minutes to read
- 6 contributors
The Team Data Science Process (TDSP) provides a recommended lifecycle that you can use to structure your data-science projects. The lifecycle outlines the complete steps that successful projects follow. If you use another data-science lifecycle, such as the Cross Industry Standard Process for Data Mining (CRISP-DM) , Knowledge Discovery in Databases (KDD) , or your organization's own custom process, you can still use the task-based TDSP.
This lifecycle is designed for data-science projects that are intended to ship as part of intelligent applications. These applications deploy machine learning or artificial intelligence models for predictive analytics. Exploratory data-science projects and improvised analytics projects can also benefit from the use of this process. But for those projects, some of the steps described here might not be needed.
Five lifecycle stages
The TDSP lifecycle is composed of five major stages that are executed iteratively. These stages include:
- Business understanding
- Data acquisition and understanding
- Customer acceptance
Here is a visual representation of the TDSP lifecycle:
The TDSP lifecycle is modeled as a sequence of iterated steps that provide guidance on the tasks needed to use predictive models. You deploy the predictive models in the production environment that you plan to use to build the intelligent applications. The goal of this process lifecycle is to continue to move a data-science project toward a clear engagement end point. Data science is an exercise in research and discovery. The ability to communicate tasks to your team and your customers by using a well-defined set of artifacts that employ standardized templates helps to avoid misunderstandings. Using these templates also increases the chance of the successful completion of a complex data-science project.
For each stage, we provide the following information:
- Goals : The specific objectives.
- How to do it : An outline of the specific tasks and guidance on how to complete them.
- Artifacts : The deliverables and the support to produce them.
For examples of how to execute steps in TDSPs that use Azure Machine Learning, see Use the TDSP with Azure Machine Learning .
This article is maintained by Microsoft. It was originally written by the following contributors.
- Mark Tabladillo | Senior Cloud Solution Architect
To see non-public LinkedIn profiles, sign in to LinkedIn.
- What is the Team Data Science Process?
- Compare the machine learning products and technologies from Microsoft
- Machine learning at scale
Submit and view feedback for
Towards Data Science
Feb 17, 2020
CRISP-DM methodology leader in data mining and big data
A short step by step guide of the machine learning methodology.
In March 2015, I collaborated on a paper, called “Methodological Business proposals for the Development of Big Data Projects” , together with Alberto Cavadia, and Juan Gómez. Back then, we realized that big data projects usually have 7 parts.
Shortly after, I used CRISP-DM methodology for my thesis because it was an open standard, widely used  on markets, and (thanks to previous paper) I knew it was quite similar to other approaches.
As my data layer professional career develops, I can’t avoid noticing that CRISP-DM methodology stills quite relevant. Actually, data management units and IT profiles are built around the steps of this methodology. So I decided, to dedicate a short story, to describe the steps of the long winning methodology.
CRISP-DM stands for Cross Industry Standard Process for Data Mining and is a 1996 methodology created to shape Data Mining projects. It consists of 6 steps to conceive a Data Mining project and they can have cycle iterations according to developers’ needs. Those steps are Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment.
The first step is Business Understanding and its objective is to give context to the goals and to the data so that the developer/engineer gets a notion of the relevance of data in that particular business model.
It is composed of meetings, online meetings, documentation reading, specific field learning, and a long list of ways they help the development team, make questions about relevant context.
The product of this step is that the development team understands the context of the project. The goals of the project should be defined before the project starts. For example, develop team should know by now that the objective is to increase sales, and after the step is over, understand what is the client selling and how they sell it.
The second step is Data Understanding and its objective is to know what can be expected and achieved from the data. It checks the quality of the data, in several terms, such as data completeness, values distributions, data governance compliance.
This is a crucial part of the project because it defines how viable and trustworthy can be the final results. In this step, team members brainstorm on how to extract the best value of the pieces of information. In case, the use or relevance of some piece of data is unclear to the development team, they can momentarily step back, to understand the business and how it benefits from that piece of information.
Thanks to this step data scientist now know-how, on terms of data, the result should satisfy the goals of the project, what algorithm and process bring that result, how is the current state of the data, and how it should be, in order to be useful to the algorithm and process involved.
The third step is Data Preparation and involves the ETLs or ELTs process that turns the pieces of data into something useful by the algorithms and process.
Sometimes data governance policies are not respected or set in an organization, and in order to give true meaning to data, it becomes data engineers and data scientists’ job to standardize the information.
Likewise, some algorithms perform better under certain parameters, someone doesn’t accept no-numerical values, others don’t work ok with a large variance on values. Then again, it is up to the development team to normalize information.
Most of the projects spent the majority of their time on this step. This step, I believe, is the reason there’s an IT profile call data engineer. As is time-consuming, that can get really complex when working with large amounts of data, IT departments could find an advantage in dedicating resources to specifically perform these duties.
The fourth step is Modeling and is the core of any machine learning project. This step is responsible for the results that should satisfy or help satisfied the project goals.
Although is the glamorous part of the project, it is also the shortest in time, as if everything previous is done correctly, there is little to adjust. In case, the results are improvable, the methodology is set to step back to data preparation and improve the available data.
Some algorithms such as k-means, hierarchical clustering, time series, linear regression, k-nearest neighbors, an amount many several others, are the core code lines of this step in the methodology.
The fifth step is Evaluation where it is up to verify that the results are valid and correct. In case the results are wrong, the methodology permits the review back to the first step, in order to understand why the results are mistaken.
Usually, on a data science project, the data scientist, divide the data into training and testing. On this step the testing data is used, its objective is to verify that the model (product of the modeling step) is accurate to the reality.
Depending on the task and the context, there are diverse techniques. For example on the context of supervised learning, with the task of classifying items, one way to verify the results is with the confusion matrix. For unsupervised learning, to make evaluation becomes harder, as there is none static value to separate “correct” from “incorrect”, for example, the task of classifying items would be evaluated by calculating the inter and intra distance between elements in a(some) cluster(s).
In any case, it is important to specify some source of error measure. This error measure tells the user how can they have confidence in the results, either for: “for sure this will work” or “for sure it won’t”. If somehow the error measure happens to by 0 or none for all cases, it would indicate that the model its overfit, and reality might perform differently.
The sixth and last step is Deployment and it consists of present the results in a useful and understandable manner, and by achieving this, the project should achieve its goals. It is the only step not belonging to a cycle.
Depending on the final user a useful and understandable manner might vary. For example, if the final user is another piece of software, as in the sales website program asking its recommendation system what to suggest for a buyer, a useful manner would be a JSON carrying the response to a specific query. In another case, like a top executive who requires projected information for decision making, the best manner to present the findings is to store then in an analytical database and present them as a dashboard on a business intelligence solution.
I decided to write this short description/explanation because I’m surprised by the long relevance of the methodology. This methodology has been there for a long time, and it seems like it would prevail longer.
This methodology is quite logical and forwards on its steps. As it evaluates all aspect on a data mining project and allows circles on its execution, so is robust and trust gainer. It is no surprise that most developers and project managers choose it and that the alternative methodologies are quite similar.
I hope this short introduction, to help IT professionals to give argumentation on the methodological development of their tasks. Several other areas of informatics can read this story and get a basic understanding of what data scientists are doing, and how it relates to other profiles such as data engineer and business intelligence.
I hope you have enjoyed it as this is my first story :).
 algedroid, Team, Work, Business, Cooperation (2019), URL: https://pixabay.com/photos/team-work-business-cooperation-4503157/
 Alberto Cavadia, Juan Gómez, e Israel Rodríguez, Propuestas Empresariales Metodológicas para el Desarrollo de Proyectos de Big Data(2015), PAPER DE CIENCIAS DE DATOS
 Gregory Piatetsky , CRISP-DM, still the top methodology for analytics, data mining, or data science projects (2014), URL: https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html
 Kenneth Jensens, Process diagram showing the relationship between the different phases of CRISP-DM (2012), URL: https://es.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining#/media/Archivo:CRISP-DM_Process_Diagram.png
More from Towards Data Science
Your home for data science. A Medium publication sharing concepts, ideas and codes.
About Help Terms Privacy
Get the Medium app
Data layer IT professional
Text to speech
Book categories, collections.
- Technology Articles
- Information Technology Articles
- Data Science Articles
- General Data Science Articles
Phases of the Data Mining Process
Data mining for dummies.
Sign up for the Dummies Beta Program to try Dummies' newest way to learn.
The Cross-Industry Standard Process for Data Mining ( CRISP-DM ) is the dominant data-mining process framework. It's an open standard; anyone may use it. The following list describes the various phases of the process.
Business understanding: Get a clear understanding of the problem you're out to solve, how it impacts your organization, and your goals for addressing it. Tasks in this phase include:
Identifying your business goals
Assessing your situation
Defining your data mining goals
Producing your project plan
Data understanding: Review the data that you have, document it, identify data management and data quality issues. Tasks for this phase include:
Data preparation: Get your data ready to use for modeling. Tasks for this phase include:
Modeling: Use mathematical techniques to identify patterns within your data. Tasks for this phase include:
Evaluation: Review the patterns you have discovered and assess their potential for business use. Tasks for this phase include:
Reviewing the process
Determining the next steps
Deployment: Put your discoveries to work in everyday business. Tasks for this phase include:
Planning deployment (your methods for integrating data mining discoveries into use)
Reporting final results
Reviewing final results
About This Article
This article is from the book:.
- Data Mining For Dummies ,
About the book author:
Meta S. Brown helps organizations use practical data analysis to solve everyday business problems. A hands-on data miner who has tackled projects with up to $900 million at stake, she is a recognized expert in cutting-edge business analytics.
This article can be found in the category:
- General Data Science ,
- Data Mining For Dummies Cheat Sheet
- How to Get Data from Weka
- Labeling Data
- 3 Ways to Work Fast with Graphs Galore
- 5 Ways to Extend Your Graphics Range
- View All Articles From Book
Data Science Process Alliance
What is SEMMA?
The SAS Institute developed SEMMA as the process of data mining. It has five steps ( S ample, E xplore, M odify, M odel, and A ssess), earning the acronym of SEMMA. You can use the SEMMA data mining methodology to solve a wide range of business problems, including fraud identification, customer retention and turnover, database marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as risk, affinity, and portfolio analysis.
Businesses use the SEMMA methodology on their data mining and machine learning projects to achieve a competitive advantage, improve performance, and deliver more useful services to customers. The data we collect about our surroundings serve as the foundation for hypotheses and models of the world we live in.
Ultimately, data is accumulated to help in collecting knowledge. That means the data is not worth much until it is studied and analyzed. But hoarding vast volumes of data is not equivalent to gathering valuable knowledge. It is only when data is sorted and evaluated that we learn anything from it.
Thus, SEMMA is designed as a data science methodology to help practitioners convert data into knowledge.
The 5 Stages Of SEMMA
SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS to be associated with their SAS Enterprise Miner initiative. While it is true that the SEMMA process is more ambiguous to those not using the tool, most regard it as a functional data mining methodology rather than a specific tool.
The process breaks down into its own set of stages. These include:
- Sample : This step entails choosing a subset of the appropriate volume dataset from a vast dataset that has been given for the model’s construction. The goal of this initial stage of the process is to identify variables or factors (both dependent and independent) influencing the process. The collected information is then sorted into preparation and validation categories.
- Explore : During this step, univariate and multivariate analysis is conducted in order to study interconnected relationships between data elements and to identify gaps in the data. While the multivariate analysis studies the relationship between variables, the univariate one looks at each factor individually to understand its part in the overall scheme. All of the influencing factors that may influence the study’s outcome are analyzed, with heavy reliance on data visualization.
- Modify : In this step, lessons learned in the exploration phase from the data collected in the sample phase are derived with the application of business logic. In other words, the data is parsed and cleaned, being then passed onto the modeling stage, and explored if the data requires refinement and transformation.
- Model : With the variables refined and data cleaned, the modeling step applies a variety of data mining techniques in order to produce a projected model of how this data achieves the final, desired outcome of the process.
- Assess : In this final SEMMA stage, the model is evaluated for how useful and reliable it is for the studied topic. The data can now be tested and used to estimate the efficacy of its performance.
Don’t Miss Out on the Latest
Sign up for the Data Science Project Manager’s Tips to learn 4 differentiating factors to better manage data science projects. Plus, you’ll get monthly updates on the latest articles, research, and offers.
How Popular is SEMMA?
In four polls spanning from 2002 to 2014 from KDnuggets.com, respondents selected SEMMA 7 – 13% of the time. While significantly less than CRISP-DM , this represents the second most commonly selected pre-defined framework.
We conducted a similar poll on this site in 2020. SEMMA was only selected by a single person. This is not a true comparison to KDnuggets’ polls as our audience likely has different demographics and our result options and question were different.
However, anecdotally, we don’t encounter many practitioners who have even heard of SEMMA. And given its myopic focus (as discussed in the next section), SEMMA likely has fallen out of favor with more modern and comprehensive data science methodologies .
SEMMA vs KDD Process vs CRISP-DM
The CRoss Industry Standard Process in Data Mining ( CRISP-DM ) and the Knowledge Discovery in Databases ( KDD ) Process are two similar data mining life cycles.
In comparing KDD and SEMMA, on a high level the parallels draw themselves. The Sample stage is relatively comparable to KDD’s Selection, and both the Pre-processing and Explore phases achieve the same basic function in their respective processes.
The Modification stage, much like the Transformation KDD equivalent is responsible for refining sorted data from the stage before it, and the Modeling phase is a loose equivalent to Data Mining (as defined by KDD) in the sense that it is when the collected, selected, and refined data is brought together through various tests in order to test the derived knowledge and illustrate it more visually. Finally, the Assess step of SEMMA is a near direct equivalent to the KDD’s evaluation phase, where the data mining/modeling results are tested for their efficacy, and previously unknown findings are funneled back to refine the cyclical process.
SEMMA is a rather myopic approach toward data science projects. It does its job at explaining the core technical steps of a machine learning life cycle . However, as data science projects enter mainstream organizations, a more comprehensive approach is needed. Some good starting points include:
- What is the Data Science Life Cycle?
- What is a Data Science Process?
Or to truly master data science project management, consider earning the Data Science Team Lead certification.
Curious? Read our White Paper
Learn the five unique challenges of data science projects and how to overcome them.
Get a grasp on CRISP-DM, Scrum, and Data Driven Scrum.
And understand how to leverage best practices to deliver data science outcomes.
Thank you for your interest in a DSPA course!
Please fill out the form below as a first step towards course registration.
- Team Lead Foundations ($795)
- Team Lead ($1,395)
- Team Lead Plus ($1,995)
Finally… A Field Guide for Managing Data Science Projects
Data science projects are unique. It’s time to start managing them as such.
Get the jumpstart guide to better manage your next project.
- Software Developer
- Data Scientist / Data Engineer
- Data Science Manager / Team Lead
Last Updated on January 31, 2023
The stages of mining: 5 lifecycle processes explained
When looking at mining stocks, it's easy to only focus on the finished product that you are investing your money in. Whether that's uranium, gold, silver, palladium or any other natural resource, it is necessary to understand the full extraction process in order to really appreciate the asset.
Just like no two diamonds are the same, neither are two mining projects.
Every billion-dollar project varies in some way (location, commodity, size) but there are 5 key stages that all miners follow that form the backbone of mine development.
The 5 Lifecycle Stages of Mining
1. Exploration & Prospecting Stage
This is the first and most essential step of the mining process: in order to open a mine, companies must first find an economically sufficient amount of the deposit (an amount of ore or mineral that makes exploitation worthwhile.)
Geologists are enlisted by the companies to understand the characteristics of the land to identify the presence of mineral deposits.
What is a geologist?
A geologist studies the solid, liquid, and gaseous matter of the Earth as well as the processes that shape them. A mining geologist is responsible for mapping out the locations of valuable minerals and will use aerial photographs, field maps, and geophysical surveys, to determine where valuable materials are and estimate how much of those materials are in that location.
Exploration geologists search for mineral resources and get involved in the planning and expansion of mining operations. They locate and evaluate potential deposits of precious metals, industrial minerals, gemstones, pigments, construction materials or other minable commodities.
What mining techniques are used by geologists?
Geological surface mapping and sampling .
A Geologist will record all geological information from the rocks that outcrop at the surface and will look for boundaries between different rock types and structures, look for fault-lines and evidence of the rocks undergoing deformation. The geologist will look for ore minerals, evidence of metal-rich fluids passing through the rock, and recording mineralised veins and their distribution.
Mining companies need to target and prioritise their drilling activity so will use this data to target more specific areas where rock and mineral sampling might be appropriate. High-resolution geological mapping can also delineate areas of likely mineralisation which will lead to potential deposits.
Geophysical measurements are taken for mineral exploration to collect information about the physical properties of rocks and sediments. Geophysical companies employ the use of magnetic, radiometric, electromagnetic and gravity surveys to detect responses which may indicate the presence of mineral deposits.
Exploration geophysics is used to detect the type of mineralisation, by measuring its physical properties. It is used to map the subsurface structure of a region, to understand the underlying structures, the spatial distribution of rock units, and to detect structures such as faults, folds and intrusive rocks.
A chemical analysis that determines the proportion of metallic or non-metallic presence in a sample is called an assay. A wide variety of geological materials can be chemically analysed which include water, vegetation, soil, sediment and rock.
Assay labs can provide single and multi-element analyses by a variety of methods. Rock and soil samples are crushed, powdered, fused or digested in acid and then analysed using several different analytical methods and instruments.
Water, oil and soil tests
Most metallic ore deposits are formed through the interaction of an aqueous fluid and host rocks. Baseline samples are taken to determine hydrologic conditions and natural occurrences of potentially toxic elements in rocks, soils, and waters.
Surface geochemical analysis of soil, rock, water, vegetation, and vapour for trace amounts of metals or other elements that may indicate the presence of a buried ore deposit. Geochemical techniques have played a key role in the discovery of numerous mineral deposits, and they continue to be a standard method of exploration.
Rock, water, soil and vegetation samples collected by prospectors and geoscientists can either be tested on-site or in laboratories called assay labs.
Airborne or ground geophysical surveys
Through either ground or airborne methods, geophysical companies undertake magnetic, radiometric and electromagnetic surveys to detect a response which may indicate potential deposits of mineral resources.
Airborne geophysical surveys are used for mineral exploration for mapping exposed bedrock, geological structures, sub-surface conductors, paleochannels, mineral deposits and salinity. There are several airborne geophysical methods used for minerals exploration including aeromagnetics, radiometrics and VTEM. A digital elevation model (DEM) is also used as an addition to most airborne geophysical surveys. Gravity surveys can also be conducted from the air as well as from the ground.
Ground-based geophysical surveys are implemented once mining companies have identified potential deposits at a regional scale and are performed from the soil surface, through boreholes, excavations or in a combination of placing sources and detectors.
Mineral exploration involves drilling to probe the contents of known ore deposits and potential sites to produce rock chips and samples of the core.
Drilling is used in areas that have been identified as targets with potential deposits based on geological, geophysical and geochemical surveys which have led to the design of the drilling programme. The aim is to obtain detailed information about rock types, mineral content, rock fabric, and the relationship between the rock layers close to the surface and at depth.
Samples taken from the orebody are taken to the lab and geologists can analyse the core by chemical assay and conduct petrologic, structural, and mineralogical studies of the rock.
Exploration objectives are to find the ore and the drilling and sampling will provide the information upon which to base estimates of its quantity and grade.
Estimates of ore grade are based on the assays of samples obtained from drill holes into the ore. The accuracy of the estimates will depend on the care taken in procuring the samples and the judgment used in deciding on sample interval required, the accuracy in assaying, and the proper weighting of the individual assays in combining them for determining average grades of individual ore blocks, especially the treatment of erratic high values.
Valuable minerals are distributed unevenly and are present in varying degrees of purity throughout the material so that assays of individual samples may vary widely throughout sampling.
Companies must also take into account the socio-economic effects that the presence of a new mine could have on the area and surrounding communities.
Mining activities, including prospecting, exploration, construction, operation, maintenance, expansion, abandonment, decommissioning and repurposing of a mine can impact social and environmental systems in a range of positive and negative ways. Mining companies need to integrate environmental and social impact assessments into mining projects.
These assessments are the process of determining, analysing and evaluating the potential environmental and social impacts of a mining project, and designing appropriate implementation and management plans for the mining life cycle.
At the end of the exploration stage, miners are able to draw up a preliminary outline of the potential size of the deposits found using 2D or 3D models of the geological ore. An orebody model serves as the geological basis of all resource estimation and starts with a review of existing drill hole and surface or underground sample data as well as maps and plans with current geological interpretation.
2. Discovery Stage
Mine-site Design & Planning
Once the miners are sufficiently confident that there is a financially viable amount of deposit, the project can progress to the planning stage.
Companies will create multiple plans with different variables (time-span, amount of ore mined) to evaluate which fulfils the most criteria.
Planning criteria & permit considerations:
From exploration to mining of mineral resources, it is vital to ensure that critical safety and operational risks are considered in designing a mine. The mine plan should allow the miners to work in the safest way possible.
The safety and wellbeing of employees, contractors and local communities is a big concern for responsible mining companies and a mine plan will look at any aspect of mine operations that could have a direct impact on the wellbeing of workers, contractors and communities.
The mine plan needs to be designed to keep the damage to the environment to a minimum using strategies that can reduce environmental impact. Lower impact mining techniques will reduce interference at the mining site. Mining waste such as tailings, rocks and wastewater can be reused on or off-site.
Eco-friendly equipment such as electric engines which will result in big carbon savings and longer lasting equipment will cut down on waste over time.
Many former mine-sites are left unusable by landowners once the mine life has come to an end. Mine companies can employ land rehabilitation techniques such as topsoil replenishment and reforestation schemes to make the land productive again and speed up the land’s natural recovery process.
Illegal mining is a significant issue for the industry so preventing illegal or unregulated mining operations will help ensure that all mining is bound by the same environmental standards and ensure accountability.
Mine development starts when a deposit is discovered and continues through to the start of construction. The technical feasibility and the economic viability of each project are determined during the phases of mine development, with more detailed engineering data required at each stage.
- The Preliminary Economic Assessment (PEA) is an early level study and the preliminary evaluation of the mining project. A PEA is useful to determine if subsequent exploration activities and engineering studies are warranted. However, it is not valid for economic decision making or for reserve reporting.
- The Pre-Feasibility Study (PFS) is an intermediate step in the engineering process to evaluate the technical and economic viability of a mining project. The pre-feasibility study is a critical step for project development as it represents the minimum prerequisite for conversion of a geologic resource into a reportable reserve.
- A Feasibility Study (FS) represents the next and most detailed step in the engineering process for evaluating a mining project and is a comprehensive technical and economic study of the development.
- A Bankable Feasibility Study (BFS) also known as a definitive feasibility study (DFS) is the final piece of the financing puzzle. The results of the study serve as the basis for a final decision whether to proceed with the mine plans. It would be unusual for a company to get finance in place without one.
Corporate social responsibility
Social responsibility is very important in the world of mining and companies are finding it beneficial to strengthen their corporate social responsibility (CSR) efforts and find ways to give back to the surrounding community.
Mines often employ a large percentage of the local residents as their workforce and some companies get involved by financing local suppliers and so promoting local trade and growing the local economy. They also fund shared infrastructure in power distribution, roads, and water treatment and distribution.
Other companies become involved in local communities by supporting climate change programmes and environmental stewardship and wildlife projects, contributing to local and regional programmes including sponsorship of educational and sporting events, local medical facilities and the funding of local children’s schemes and arts festivals.
Companies aim to employ local labour and trades people wherever possible and focus on educational, health and infrastructure improvements that will have the greatest impact on the quality of life.
3. Development Stage
Once the plan has been confirmed, the real work can begin. This is the longest stage of the process so far, and can take anywhere from 10-20 years before the mine is ready for production, depending on the site size.
Does a mines size affect the amount of ore produced?
Measuring mine productivity can be difficult given how unique each operation is.
Mines set their production goals but productivity at some mines is restricted by location. Mines are trying to minimize operating expenditure while continuing to increase productivity.
What does construction involve?
The construction of roads, rail, air-strips or ports to access the mine plus the services such as water, sewage and power is similar to the work required for establishing other types of industries except that this construction could be in remote areas with added logistical challenges.
Mining roads are a critical component of mining infrastructure and the performance of these roads has a direct impact on operational efficiency, costs and safety. A significant proportion of a mine’s cost is associated with material haulage and well-designed and managed roads contribute directly to reductions in cycle times, fuel burn, tyre costs and overall cost per tonne hauled and critically, underpin a safe transport system.
Development of the mine itself is different for an open pit to an underground mine and will require different experience and equipment. Porphyry deposits are often large and many of the deposits are near the surface and mined as open pits with large mining equipment; however, at depth some may have suitable characteristics to convert to large underground block caving mines. Vein type deposits are often narrow, can go to depth and are mined by underground methods with smaller equipment.
Once the mineral is extracted from a mine, it is processed and the processing operation depends on which material is excavated. The crushing and processing facility is constructed based on the testing, flow sheet and design determined in the FS. Processing of the ore starts with understanding the mineralogy and the metallurgical testing for crushing, grinding and recovery of the metals and treatment/management of the tailings.
Environmental management systems
Environmental aspects are included on the FS which has determined the current environmental habitat and the long-term impact of building the mine. The FS will also have determined the quantity and quality of all ore and waste to be mined plus tailings, the potential to generate acid and other deleterious metals plus how to treat these issues while operating and at closure.
Also included in the FS is the amount and quantity of water that will be used during operation and whether the water will need long-term treatment. Some countries require the FS as the basis for submitting plans for required mining permits.
An Environmental Management System (EMS) is part of the management system and includes organizational procedures, environmental responsibilities, and processes and will help the mining company comply with environmental regulations, identify technical and economic benefits, and ensure that corporate environmental policies are adopted and followed.
Mining companies with economical and technological flexibility have implemented comprehensive EMSs at current sites but these require input from governments, international environmental organizations, educational facilities, and the companies themselves.
Mine planning includes decisions on workforce accommodation which will affect not only employee quality of life but also the impacts and relationships with existing local communities. Workforce accommodation are usually community-based (either as purpose-built company towns or integrated within existing local communities) or commuter (fly-in, fly-out) mine camps which will depend on the location of the mine and how remote it is.
The quality of accommodation underpins the fulfilment, morale and motivation of employees. This is not only relevant to productivity and safety, but also to recruitment and retention, particularly with the significant human resources crisis. If communities exist close to a proposed mine then the accommodation strategy can influence the value-adding potential for the sustainable development of such communities.
Where mine locations are isolated in remote areas and/or face significant economic, social and political adversity, the decisions on employee housing are more challenging. The mine company will need to understand the complexity of local planning issues and consider environmental, social, economic and political implications, together with the proposed accommodation strategy.
- Maintenance facilities - location for service and repair of mine equipment to reduce downtime and ensure that production capacity and safety objectives are met.
- Management offices, workshops, storage, refuelling, and power generation facilities. In some cases, a control tower may also be constructed to offer a complete view of processing operations.
- Transportation for mine personnel - Miners, contractors, and supervisors need to move between work areas which may be spread across a wide area.
4. Production Stage
Now the mine is finally ready to begin producing.
What are the two common methods of mining?
Surface mining .
Surface mining is a broad category of mining in which the soil and rock overlying the mineral deposit is removed. It has been estimated that more than two-thirds of the world’s yearly mineral production is extracted by surface mining.
Surface mining is the preference for mining companies because removing the terrain surface to access the mineral beneath is often more cost-effective than digging tunnels and shafts to access mineral resources underground.
Surface mining methods:
- Strip Mining involves stripping the surface away from the mineral that’s being excavated (usually coal). Soil, rock, and vegetation over the mineral seam is removed with huge machines, including bucket-wheel excavators.
- Open-Pit Mining is a technique of extracting rock or minerals from the earth by their removal from an open-air pit. Open-pits are sometimes called ‘quarries’ when they produce building materials and dimension stone.
- Mountaintop Removal Mining for retrieving minerals from mountain peaks and involves blasting the overburden with explosives above the mineral seam to be mined. The broken mountaintop is then shifted into valleys and fills below.
- Dredging is the more sophisticated version of panning for gold where a scoop lifts material up on a conveyor belt, and the mineral is removed, then the unwanted material is put back into the water.
- Highwall mining collects ores from a “highwall” with overburden and exposed minerals and ores.
Underground mining is used to access ores and valuable minerals in the ground by digging into the ground to extract them. There are several underground mining techniques used to excavate hard minerals, usually those containing metals such as ore containing gold, silver, iron, copper, zinc, nickel, tin and lead, but also for excavating ores of gems such as diamonds and rubies.
Underground mining methods:
- Ore is natural rock that contains valuable minerals, typically metals and is extracted from the earth through mining and extracting the valuable metals or minerals. The grade of ore refers to the concentration of the valuable material it contains.
- Subsurface mining involves digging tunnels or shafts into the earth to reach buried ore deposits. Ore and waste rock are brought to the surface through the tunnels and shafts.
- The recovered minerals are processed using large crushers, mills, reactors, roasters and other equipment to consolidate the mineral-rich material and extract the desired compounds and metals from the ore.
- Ore is separated from the waste rock, the rocks are crushed and the minerals are separated from the ore by:
- Heap Leaching - addition of chemicals such as cyanide or acid to remove the ore. This is often done at very high temperatures.
- Flotation - addition of a compound that attaches to the valuable mineral and floats.
- Smelting facilities - roasting rock at a temperature greater than 900oC. This causes it to segregate into layers. The valuable minerals are then extracted.
- Once the mineral is extracted, it is often processed to extract the valuable metal from its ore through chemical or mechanical means which will depend on the mineral resource present.
- The ore is then poured into moulds to create bars of bullion (metal formed into bars or ingots) ready for sale.
5. Reclamation Stage
Before the company can be issued a permit to build the mine, they must first prove that they have the funds and plans to close the mine in a safe and structured way.
Mining is a temporary activity, once the deposit is gone it's time to relocate to a new site. But before they can do this, they must first close and rehabilitate the mine.
What needs to happen before a mine can close?
The final step in mining operations is closure and reclamation. Mine companies have to think about a mine closure plan before they start to build as governments need assurances that operators have a plan and the required funds to close the mine before they are willing to issue permits.
Detailed environmental studies form a big part of the mine closure plan on how the mine site will be closed and rehabilitated. A comprehensive mine rehab programme will also include:
Ensuring public health and safety
There are many dangers with abandoned mines, many of which are not visible from the outside, including horizontal openings, vertical shafts, explosives and toxic chemicals, dangerous gases, deep water, spoils piles, abandoned unsafe buildings and high walls. Mine companies need to ensure mines are fully closed and sealed to make them safe for the public.
Removing waste and hazardous material
There is a high-volume of waste material that originates from the processes of excavation, dressing and further physical and chemical processing of metalliferous and non-metalliferous minerals and mine companies need to remove waste and hazardous material from the site both during operation and at closure of the mine.
Establishing new landforms and vegetation
Reclamation of mined areas involves the re-establishment of viable soils and vegetation at a mine site. For example, a simple approach could add lime or other materials that will neutralize acidity plus a cover of topsoil to promote vegetation growth. Modifying slopes and planting vegetation will stabilise the soil and prevent erosion.
Minimising environmental effects
A landscape affected by mining can take a long time to rehabilitate and mine companies need to minimise environmental effects during mine life and mitigate the impacts of mining from from the discovery phase through to closure:
Preserving water quality
The initial closure plan usually focuses on water quality and where the water will go after closure and the quantity of water which will either discharge or migrate into the groundwater system after flooding.
Mine companies must find ways of protecting groundwater and surface water resources and to understand the risks related to water quantity and quality and to develop appropriate engineering controls and reclamation measures.
Stabilising land to protect against erosion
Reduction of slopes by land infill and reclamation, growing plants and trees on mined areas will stabilise the soil and reduce erosion by binding the soil and protecting the ground. Good erosion control will help keep valuable soils on the land and allow natural growth and regeneration.
Mine closure plans can aim to renovate the site to varying degrees:
Cleaning up the contaminated area, removing all mine wastes including water and the treatment of water. Isolating contaminated material.
Stabilising the terrain, infill, landscaping and topsoil replacement to make the land useful once again.
Rebuilding any part of the ecosystem that was disturbed as a result of the mine such as flora and fauna. The planting of trees and vegetation native to the area to allow regeneration.
Rehabilitating the site to a stable and self-rejuvenating state, either as it was before the mine was built or as a new equivalent ecosystem to take local environmental conditions into account. Mines can be repurposed for other uses such as for agriculture, solar panel farms, biofuel production or even recreational and tourist use.
Mine closure process:
1. shut-down: .
Production stops and workers are reduced. Some skilled workers are retained to permanently shut down the mine. Re-training or early retirement options are sometimes provided.
The mine is decommissioned by workers or contractors who take apart the mining processing facilities and equipment which is cleaned to be stored or sold. Buildings are repurposed or demolished, warehouse materials are recovered, and waste is disposed of.
The land and watercourses are reclaimed to a good standard to ensure any landforms and structures are stable, and watercourses are of acceptable water quality. Hazardous materials are removed and land is reshaped and restored by adding topsoil and planting native grasses, trees, or ground cover.
It is important to assess the reclamation programme post closure and to identify any further actions required. Mines may require long-term care and maintenance after mine closure such as ongoing treatment of mine discharge water, periodic monitoring and maintenance of tailings containment structures, and monitoring any ongoing remediation technologies used such as constructed wetlands.
What happens to a mine once it’s closed?
Post-mining land use is an important issue in mine lifecycle planning and there are many extraordinary examples of how mine sites can be repurposed, from underground bike parks to luxury hotels. Some examples of which are:
Now that you understand how a mine works, it's time to decide how you want to invest in mining stocks. Major companies? Junior companies? Gold? Silver? Uranium? This list is endless. This is a good starting point: Complete Guide: How to Invest in Mining Stocks (New 2021)
Looking for even more?
That's where we come in. Crux Investor is an investing app for busy people.
You’ll receive a single stock recommendation each month , curated by industry experts and presented in a clear and focused one-page memo. You’ll also receive access to a platform full of programmes that will allow you to grow your financial knowledge, overall, all at your own pace.
Crux Investor is for anyone interested in saving time while investing with confidence. It's an ideal resource for the novice that needs guidance and is tired of throwing money away with guesses and gambles. But it's also a perfect fit for the experienced investor that wants a faster and more efficient way to arrive at the perfect stock or significantly increase their knowledge.
Finally, you can afford the analysts the big funds use. No more gambling, no more guesswork. Instead, save time, slay stress, and start investing with confidence by joining Crux Investor today.
Please note that Internet Explorer version 8.x is not supported as of January 1, 2016. Please refer to this page for more information.
Data Mining Project
- Software Development
- Data Mining
- Industry Standard
- Data Mining Process
- Predictive Model
Theoretical Considerations for Data Mining
Robert Nisbet Ph.D. , ... Ken Yale D.D.S., J.D. , in Handbook of Statistical Analysis and Data Mining Applications (Second Edition) , 2018
General Requirements for Success in a Data Mining Project
Following are general requirements for success of a data mining project :
Results will identify “low-hanging fruit,” as in a customer acquisition model where analytic techniques haven't been tried before (and anything rational will work better).
Improved results can be highly leveraged; that is, an incremental improvement in a vital process will have a strong bottom-line impact. For instance, reducing “charge-offs” in credit scoring from 10% to 9.8% could make a difference of millions of dollars.
A team skilled in each required activity. For other than very small projects, it is unlikely that one person will be sufficiently skilled in all activities. Even if that is so, one person will not have the time to do it all, including data extraction, data integration, analytic modeling, and report generation and presentation. But, more importantly, the analytic and business people must cooperate closely so that analytic expertise can build on the existing domain and process knowledge.
Data vigilance: Capture and maintain the accumulating information stream (e.g., model results from a series of marketing campaigns).
Time: Learning occurs over multiple cycles. Early models can be improved by performing error analyses, which can point to changes in the data preparation and modeling methodology to improve future models. Also, champion-challenger tests with multiple algorithms can produce models with enhanced predictability. Successive iterations of model enhancement can generate successive increases in success.
Each of these types of data mining applications followed a common methodology in principle. We will expand on the subject of the data mining process in Chapter 3 .
David Nettleton , in Commercial Data Mining , 2014
Evaluation of Viability in Terms of Available Data – Specific Considerations
The following list provides specific considerations for evaluating the viability of a data mining project in terms of the available data:
Does the necessary data for the business objectives exist, and does the business have access to it?
If part or all of the data does not exist, can processes be defined to capture or obtain it?
What is the coverage of the data with respect to the business objectives?
What is the availability of a sufficient volume of data over a required period of time, for all clients, product types, sales channels, and so on? (The data should cover all the business factors to be analyzed and modeled. The historical data should cover the current business cycle.)
Is it necessary to evaluate the quality of the available data in terms of reliability? (The reliability depends on the percentage of erroneous data and incomplete or missing data. The ranges of values must be sufficiently wide to cover all cases of interest.)
Are people available who are familiar with the relevant data and the operational processes that generate the data?
This chapter has given a brief introduction to how the results of data mining projects (analysis and modeling) can be deployed in the business environment. Simple but effective options such as query/reporting and EIS have been discussed, as have more complex options such as expert systems and case-based systems. The option chosen depends on the type of business, how it is run, and the specific data needs for decision-making. If a simple report of sales leads, each with an associated probability of acceptance, does the job, then installing a complex expert system doesn’t have to feel obligatory. However, even query/reporting and EIS are not necessarily plug and play applications and usually need customizing and some technical support to get them to do what the user wants.
Finally, the following is a brief explanation of a recent news story that serves as a cautionary tale about the preparation of statistical summaries and their usage. In April 2013, world economists were surprised and perplexed by the news that two prestigious Harvard academics had made a basic error when using an Excel spreadsheet to summarize their findings and publish an influential economic model. This model has been used in recent years (circa 2013) by economists around the world to support the economic arguments that countries should cut spending to promote economic growth.
The elemental error was in the definition of an average function cell that was based on the range of values in a given column of data. Each row showed the GDP growth for a given country when the country’s debt-to-GDP ratio was 90 percent or more. The conclusion from the data (which included the error) was that, were a country’s debt-to-GDP ratio to go over 90 percent (the critical threshold), economic growth would drop off sharply. However, the range defined in the average function missed out the last five cells. (These cells included data from the countries of Denmark, Canada, Belgium, Austria, and Australia). If these countries had been included, the average GDP growth would be 2.2 percent instead of −0.1 percent! The evaluations of the findings and of this error are still being debated.
Apart from the error itself, another criticism is the lack of control of how such an error can pass unchecked and become common wisdom. However, aside from the implications of this Excel error on world economic policy, in the context of presenting key business information derived from data mining (which is the theme of this chapter), we can learn the lesson that a report and the data it is based on should be doubled-checked—not by the same person, but by a peer, colleague, manager, or subordinate who is able to independently debug any possible faulty calculations or fundamental assumptions.
Accessory Tools for Doing Data Mining
Before moving into a discussion of the proper algorithms to use for a data mining project , we must take a side trip to help you understand that modeling algorithms are just one set of data mining tools you will use to complete a data mining project. The practice of data mining includes the use of a number of techniques that have been developed to serve as a set of tools in the data miner's toolbox. In the early days of data mining, many of these tools had to be built (usually in SQL or Perl) and used in an ad hoc fashion for every job. Many of these functions have been included as separate objects in data mining packages or “productized” separately. Most jobs will require the data miner to become proficient in even those tools that are not included in a given data mining package. The following tools can help the data miner:
Data access tools : SQL and other database query languages
Data integration tools : extract-transform-load (ETL) tools to access, modify, and load data from different structures and formats into a common output format (e.g., database and flat file)
Data exploration tools : basic descriptive statistics, particularly frequency tables; slicing, dicing, and drill downs
Model management tools : data mining workspace libraries, templates, and projects
Modeling analysis tools : feature selection; model evaluation tools. ( Note: This topic will be expanded in Chapter 11 .)
Miscellaneous tools : in-place data processing (IDP) tools, rapid deployment tools, and model monitoring tools
Being able to use these tools properly can be very helpful in the identification of significant variables, facilitating rapid decision-making necessary to compete successfully in the global marketplace.
Tim Menzies , ... Burak Turhan , in Sharing Data and Models in Software Engineering , 2015
In the scout phase, rapid prototyping is used to try many mining methods on the data. In this phase, experimental rigor is less important than exploring the range of user hypotheses. The other goal of this phase is to gain the interest of the users in the induction results.
It is important to stress that feedback to the users can and must appear very early in a data mining project . Users, we find, find it very hard to express what they want from their data. This is especially true if they have never mined it before. However, once we start showing them results, their requirements rapidly mature as initial results help them sharpen the focus of the inductive study. Therefore, we recommend:
Simplicity first. Prior to conducting very elaborate studies, try applying very simple tools to gain rapid early feedback.
For example, simple linear-time column pruners, such as those discussed in the last chapter, comment on what factors are not influential in a particular domain. It can be insightful to discuss information with the users.
1.4 how to read this book.
This book covers the following material:
Part I: Data Mining for Managers : The success of an industrial data mining project depends on those technical matters as well as some very important organizational matters. This first section describes those organizational issues.
Part II: Data Mining: A Technical Tutorial : Discusses data mining for software engineering (SE) applications; several data mining methods that form the building blocks for advanced data science approaches for software engineering. For example, in this book, we apply those methods to numerous applications of data mining for SE, including software effort estimation and defect prediction.
Part III: Sharing Data : In this part, we discuss methods for moving data across organizational boundaries. The topics covered here include how to find learning contexts then how to learn across contexts (for cross-company learning); how to handle missing data; privacy; active learning; as well as privacy issues.
Part IV: Sharing Models : In this part, we discuss how to take models learned from one project and adapt and apply them to others. Topics covered here include ensemble learning; temporal learning; and multiobjective optimization.
The chapters of Parts I and II document a flow of ideas while the chapters of Parts III and IV were written to be mostly self-contained. Hence, for the reader who likes skimming, we would suggest reading all of Parts I and II (which are quite short) then dipping into any of the chapters in Parts III and IV, according to your own interests.
To assist in finding parts of the book that most interest you, this book contains several roadmaps :
See Chapter 2 for a roadmap to Part I: Data Mining for Managers .
See the start of Chapter 7 for a roadmap to Part II: Data Mining: A Technical Tutorial .
See Chapter 11 , Section 11.2 , for a roadmap to Part III: Sharing Data .
See Chapter 19 for a roadmap to Part IV: Sharing Models .
4.1 Data analysis patterns
As another guide to readers, from Chapter 12 onwards each chapter starts with a short summary table that we call a data analysis pattern :
Seven principles of inductive software engineering
T. Menzies , in Perspectives on Data Science for Software Engineering , 2016
Principle #4: Be Open Minded
The goal of inductive engineering for SE is to find better ideas than what was available when you started. So if you leave a data mining project with the same beliefs as when you started, you really wasted a lot of time and effort. Hence, some mantras to chant while data mining are:
Avoid a fixed hypothesis. Be respectful but doubtful of all human-suggested domain hypotheses. Certainly, explore the issues that they raise, but also take the time to look further afield.
Avoid a fixed approach for data mining (eg, just using decision trees all the time), particularly for data that has not been mined before.
The most important initial results are the ones that radically and dramatically improve the goals of the project. So seek important results.
Incorporating Various Sources of Data and Information
This chapter discusses data sources that can be accessed for a commercial data analysis project. One way of enriching the information available about a business’s environment and activity is to fuse together various sources of information and data. The chapter begins with a discussion of internal data; that is, data about a business’s products, services, and customers, together with feedback on business activities from surveys, questionnaires, and loyalty and customer cards. The chapter then considers external data—which affects a business and its customers in various ambits—such as demographic and census data, macro-economic data, data about competitors, and data relating to stocks, shares, and investments. Examples are given for each source and where and how the data could be obtained.
Although some readers may be familiar with one or more of these data sources, they may need help selecting which to use for a given data mining project . Table 3.1 gives examples of which data sources are relevant for which business objectives and commercial data mining activities. Columns two through eight show the seven data sources described in this chapter, and the column labeled “Business Objectives” lists generic business examples. Each cell indicates whether a specific data source would be required for the given business objective.
Table 3.1 . Business objectives versus data sources
Primary data sources.
The primary data sources include the data already in the basic data repository derived from a business’s products, services, customers, and transactions. That is, a data mining project could be considered that uses only this elemental data and no other sources. The primary data sources are indicated in the columns labeled “Internal” in Table 3.1 .
Each data mining project must evaluate and reach a consensus on which factors, and therefore which sources, are necessary for the business objective. For example, if the general business objective is to reduce customer attrition and churn (loss of clients to the competition), a factor related to customer satisfaction may be needed that is not currently in the database. Hence, in order to obtain this data, the business might design a questionnaire and launch a survey for its current customers. Defining the necessary data for a data mining project is a recurrent theme throughout Chapters 2 to 9 2 4 5 6 7 8 9 , and the results of defining and identifying new factors may require a search for the corresponding data sources, if available, and/or obtain the data via surveys, questionnaires, new data capture processes, and so on. Demographic data about specific customers can be elicited from them by using the surveys, questionnaires and loyalty registration forms discussed in this chapter.
With reference to demographic data, we distinguish between the general anonymous type (such as that of the census) and specific data about identifiable customers (such as age, gender, marital status, and so on).
Methodologies for Knowledge Discovery Processes in Context of AstroGeoInformatics
Peter Butka PhD , ... Juliana Ivančáková MSc , in Knowledge Discovery in Big Data from Astronomy and Earth Observation , 2020
1.3.3 Proprietary Methodologies – Usage of Specific Tools
While the research or open standard methodologies are more general and tool-free, some of the leaders in the area of data analysis also provide to their customers proprietary solutions, usually based on the usage of their software tools.
One of such examples is the SEMMA methodology from the SAS Institute, which provided a process description on how to follow its data mining tools. SEMMA is a list of steps that guide users in the implementation of a data mining project . While SEMMA provides still quite a general overview of KDP, authors claim that it is a most logical organization of their tools to cover core data mining tasks (known as SAS Enterprise Miner). The main difference of SEMMA with the traditional KDD overview is that the first steps of application domain understanding (or business understanding in CRISP-DM) are skipped. SEMMA also does not include the knowledge application step, so the business aspect is out of scope for this methodology ( Azevedo and Santos, 2008 ). Both these steps are in the knowledge discovery community considered as crucial for the success of projects. Moreover, applying this methodology outside SAS software tools is not easy. The phases of SEMMA and related tasks are the following:
Sample – the first step is data sampling – a selection of the dataset and data partitioning for modeling; the dataset should be large enough to contain representative information and content, but still small enough to be processed efficiently.
Explore – understanding the data, performing exploration analysis, examining relations between the variables, and checking anomalies, all using simple statistics and mostly visualizations.
Modify – methods to select, create, and transform variables (attributes) in preparation for data modeling.
Model – the application of data mining techniques on the prepared variables, the creation of models with (possibly) the desired outcome.
Assess – the evaluation of the modeling results, and analysis of reliability and usefulness of the created models.
IBM Analytics Services have designed a new methodology for data mining/predictive analytics named Analytics Solutions Unified Method for Data Mining/Predictive Analytics (also known as ASUM-DM), 3 which is a refined and extended CRISP-DM. While strong points of CRISP-DM are on the analytical part, due to its open standard nature CRISP-DM does not cover the infrastructure or operations side of implementing data mining projects, i.e., it has only few project management activities, and has no templates or guidelines for such tasks.
The primary goal of ASUM-DM creation was to solve the disadvantages mentioned above. It means that this methodology retained CRISP-DM and augmented some of the substeps with missing activities, tasks, guidelines, and templates. Therefore, ASUM-DM is an extension or refinement of CRISP-DM, mainly in the more detailed formalization of steps and application of (IBM-based) analytics tools. ASUM-DM is available in two versions – an internal IBM version and an external version. The internal version is a full-scale version with attached assets, and the external version is a scaled-down version without attached assets. Some of these ASUM-DM assets or a modified version are available through a service engagement with IBM Analytics Services. Like SEMMA, it is a proprietary-based methodology, but more detailed and with a broad scope of covered steps within the analytical project.
At the end of this section, we also mention that KDPs can be easily extended using agile methods, initially developed for software development. The main application of agile-based aspects is logically in larger teams in the industrial area. Many approaches are adapted explicitly for some company and are therefore proprietary. Generally, KDP is iterative, and the inclusion of more agile aspects is quite natural ( Nascimento and de Oliveira, 2012 ). The AgileKDD method fulfills the OpenUP lifecycle, which implements Agile Manifesto. The project consists of sprints with fixed deadlines (usually a few weeks). Each sprint must deliver incremental value. Another example of an agile process description is also ASUM-DM from IBM, which combines project management and agility principles.
Process Models for Data Mining and Analysis
Colleen McCue , in Data Mining and Predictive Analysis , 2007
What the CIA model brings in terms of specificity to intelligence, and by extension applied public safety and security analysis, the CRISP-DM process model contributes to data mining as a process, which is reflected in its origins. Several years ago, representatives from a diverse array of industries gathered to define the best practices, or standard process, for data mining. 8 The result of this task was the CRoss-Industry Standard Process for Data Mining (CRISP-DM). The CRISP-DM process model was based on direct experience from data mining practitioners, rather than scientists or academics, and represents a “best practices” model for data mining that was intended to transcend professional domains. Data mining is as much analytical process as it is specific algorithms and models. Like the CIA Intelligence Process, the CRISP-DM process model has been broken down into six steps: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. 9
Perhaps the most important phase of the data mining process includes gaining an understanding of the current practices and overall objectives of the project. During the business understanding phase of the CRISP-DM process, the analyst determines the objectives of the data mining project . Included in this phase are an identification of the resources available and any associated constraints, overall goals, and specific metrics that can be used to evaluate the success or failure of the project.
The second phase of the CRISP-DM analytical process is the data understanding step. During this phase, the data are collected and the analyst begins to explore and gain familiarity with the data, including form, content, and structure. Knowledge and understanding of the numeric features and properties of the data (e.g., categorical versus continuous data) will be important during the data preparation process and essential to the selection of appropriate statistical tools and algorithms used during the modeling phase. Finally, it is through this preliminary exploration that the analyst acquires an understanding of and familiarity with the data that will be used in subsequent steps to guide the analytical process, including any modeling, evaluate the results, and prepare the output and reports.
After the data have been examined and characterized in a preliminary fashion during the data understanding stage, the data are then prepared for subsequent mining and analysis. This data preparation includes any cleaning and recoding as well as the selection of any necessary training and test samples. It is also during this stage that any necessary merging or aggregating of data sets or elements is done. The goal of this step is the creation of the data set that will be used in the subsequent modeling phase of the process.
During the modeling phase of the project, specific modeling algorithms are selected and run on the data. Selection of the specific algorithms employed in the data mining process is based on the nature of the question and outputs desired. For example, scoring algorithms or decision tree models are used to create decision rules based on known categories or relationships that can be applied unknown data. Unsupervised learning or clustering techniques are used to uncover natural patterns or relationships in the data when group membership or category has not been identified previously. These algorithms can be categorized into two general groups: rule induction models or decision trees, and unsupervised learning or clustering techniques. Additional considerations in model selection and creation include the ability to balance accuracy and comprehensibility. Some extremely powerful models, although very accurate, can be very difficult to interpret and thus validate. On the other hand, models that generate output that can be understood and validated frequently compromise overall accuracy in order to achieve this.
During the evaluation phase of the project, the models created are reviewed to determine their accuracy as well as their ability to meet the goals and objectives of the project identified in the business understanding phase. Put simply: Is the model accurate, and does it answer the question posed?
Finally, the deployment phase includes the dissemination of the information. The form of the information can include tables and reports as well as the creation of rule sets or scoring algorithms that can be applied directly to other data.
This model has worked very well for many business applications; 10 however, law enforcement, security, and intelligence analysis can differ in several meaningful ways. Analysts in these settings frequently encounter unique challenges associated with the data, including timely availability, reliability, and validity. Moreover, the output needs to be comprehensible and easily understood by nontechnical end users while being directly actionable in the applied setting in almost all cases. Finally, unlike in the business community, the cost of errors in the applied public safety setting frequently is life itself. Errors in judgment based on faulty analysis or interpretation of the results can put citizens as well as operational personnel at risk for serious injury or death.
Table 4-1 . Comparison of the CRISP-DM and CIA Intelligence Process Models.
The CIA Intelligence Process has unique features associated with its use in support of the intelligence community, including its ability to guide sound policy and information-based operational support. The importance of domain expertise is underscored in the intelligence community by the existence of specific agencies responsible for the collection, processing, and analysis of specific types of intelligence data. The CRISP-DM process model highlights the need for subject matter experts and domain expertise, but emphasizes a common analytical strategy that has been designed to transcend professional boundaries and that is relatively independent of content area or domain. The CIA Intelligence Process and CRISP-DM models are well suited to their respective professional domains; however, they are both somewhat limited in directly addressing the unique challenges and needs related to the direct application of data mining and predictive analytics in the public safety and security arena. Therefore, an integrated process model specific to public safety and security data mining and predictive analytics is outlined below. Like the CIA model, this model recognizes not only a role but also a critical need for analytical tradecraft in the process; and like the CRISP-DM process model, it emphasizes the fact that effective use of data mining and predictive analytics truly is an analytical process that encompasses far more than the mathematical algorithms and statistical techniques used in the modeling phase.
Nearly all data projects, however, follow the same basic life cycle from start to finish. This life cycle can be split into eight common stages, steps, or phases: Generation Collection Processing Storage Management Analysis Visualization Interpretation Below is a walkthrough of the processes that are typically involved in each of them.
Here s our rundown of a data science project life cycle, including the six main steps of the cross-industry standard process for data mining (CRISP-DM) and additional steps from data science solutions that are essential parts of every data science project. This roadmap is based on decades of experience in delivering data modelling and analysis ...
The typical process of data collection, storage, analysis, and mining is outlined below. Data collection is capturing data from different sources like customer feedback, payments, and purchase orders. Data warehousing is the process of storing that data in a large database or data warehouse.
In a nutshell, the project life cycle of a data mining project according to CRISP-DM includes the following phases: Business understanding To identify the business goals and to determine how to measure success. Data understanding To select relevant data and to understand this data.
Data Preparation A common rule of thumb is that 80% of the project is data preparation. This phase, which is often referred to as "data munging", prepares the final data set (s) for modeling. It has five tasks: Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.
The phases of solving a business problem using Data Mining are as follows: Problem Definition in Terms of Data Mining and Business Goals Data Acquisition and Preparation Building and Evaluation of Models Deployment Supervised For a Supervised problem: Optional: Machine Learning - Unsupervised Learning ( Mining )
Life Cycle of a Typical Data Science Project Explained: 1) Understanding the Business Problem: In order to build a successful business model, its very important to first understand the business problem that the client is facing. Suppose he wants to predict the customer churn rate of his retail business.
5 Steps of a Data Science Project Lifecycle | by Dr. Cher Han Lau | Towards Data Science 500 Apologies, but something went wrong on our end. Refresh the page, check Medium 's site status, or find something interesting to read. Dr. Cher Han Lau 189 Followers Founder at LEAD ( https://thelead.io ).
The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied.
Step 1: Data Collection. The first step of data mining is usually data collection. Business data is stored in many systems across an enterprise. For example, there are hundreds of OLTP databases and over 70 data warehouses inside Microsoft. The first step is to pull the relevant data to a database or a data mart where the data analysis is applied.
Steps Traditional Data Mining Life Cycle: Business Understanding: This introductory stage centers on understanding the extend destinations and prerequisites from a commerce point of view, and after that changing over this information into a information mining issue definition. A preliminary arrange is planned to attain the destinations.
The TDSP lifecycle is modeled as a sequence of iterated steps that provide guidance on the tasks needed to use predictive models. You deploy the predictive models in the production environment that you plan to use to build the intelligent applications.
Pipeline: A Data Engineering Resource 3 Data Science Projects That Got Me 12 Interviews. And 1 That Got Me in Trouble. Zach Quinn in Pipeline: A Data Engineering Resource Creating The Dashboard That Got Me A Data Analyst Job Offer Stefan Pircalabu in DataDrivenInvestor 3 Data-Science Certifications you should do in order Marie Truong in
The data mining life cycle The life cycle model consists of six phases with arrows indicating the most The sequence of the phases is not strict. fact, most projects move back and forth between phases as necessary. The CRISP-DM model is flexible and can be customized easily.
Identifying your business goals. Assessing your situation. Defining your data mining goals. Producing your project plan. Data understanding: Review the data that you have, document it, identify data management and data quality issues. Tasks for this phase include: Gathering data. Describing.
The data mining process comprises several steps, such as data selection, pre-processing, transformation, interpretation, and evaluation. The section of the experimentation includes a stroke...
Here are the 7 key steps in the data mining process -. 1. Data Cleaning. Teams need to first clean all process data so it aligns with the industry standard. Dirty or incomplete data leads to poor insights and system failures that cost time and money.
The CRoss Industry Standard Process in Data Mining ( CRISP-DM) and the Knowledge Discovery in Databases ( KDD) Process are two similar data mining life cycles. In comparing KDD and SEMMA, on a high level the parallels draw themselves.
The 5 Lifecycle Stages of Mining 1. Exploration & Prospecting Stage This is the first and most essential step of the mining process: in order to open a mine, companies must first find an economically sufficient amount of the deposit (an amount of ore or mineral that makes exploitation worthwhile.)
Data vigilance: Capture and maintain the accumulating information stream (e.g., model results from a series of marketing campaigns). 4. Time: Learning occurs over multiple cycles. Early models can be improved by performing error analyses, which can point to changes in the data preparation and modeling methodology to improve future models.