Here we have mentioned most frequently asked DATA Mining Interview Questions and Answers specially for freshers and experienced.


 

1. What Is Data Mining?

Ans:

Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance dataware house can be used to mine data for the most high risk people to insure in a certain geographial area.

2. Differentiate Between Data Mining And Data Warehousing?

Ans:

Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc.

3. What Is Data Purging?

Ans:

The process of cleaning junk data is termed as data purging. Purging data would mean getting rid of unnecessary NULL values of columns. This usually happens when the size of the database gets too large.

4. What Are Cubes?

Ans:

A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.

5. What Are Olap And Oltp?

Ans:

An IT system can be divided into Analytical Process and Transactional Process.
OLTP – categorized by short online transactions. The emphasis is query processing, maintaining data integration in multi-access environment.
OLAP – Low volumes of transactions are categorized by OLAP. Queries involve aggregation and very complex. Response time is an effectiveness measure and used widely in data mining techniques.

6. What Are The Different Problems That “data Mining” Can Solve?

Ans:

• Data mining helps analysts in making faster business decisions which increases revenue with lower costs.
• Data mining helps to understand, explore and identify patterns of data.
• Data mining automates process of finding predictive information in large databases.
• Helps to identify previously hidden patterns.

7. What Are Different Stages Of “data Mining”?

Ans:

Exploration: This stage involves preparation and collection of data. it also involves data cleaning, transformation. Based on size of data, different tools to analyze the data may be required. This stage helps to determine different variables of the data to determine their behavior.
Model building and validation: This stage involves choosing the best model based on their predictive performance. The model is then applied on the different data sets and compared for best performance. This stage is also called as pattern identification. This stage is a little complex because it involves choosing the best pattern to allow easy predictions.
Deployment: Based on model selected in previous stage, it is applied to the data sets. This is to generate predictions or estimates of the expected outcome.

8. What Is Discrete And Continuous Data In Data Mining World?

Ans:

Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age.

9. What Is Model In Data Mining World?

Ans:

Models in Data mining help the different algorithms in decision making or pattern matching. The second stage of data mining involves considering various models and choosing the best one based on their predictive performance.

10. How Does The Data Mining And Data Warehousing Work Together?

Ans:

Data warehousing can be used for analyzing the business needs by storing data in a meaningful form. Using Data mining, one can forecast the business needs. Data warehouse can act as a source of this forecasting.



 

11. What Is A Decision Tree Algorithm?

Ans:

A decision tree is a tree in which every node is either a leaf node or a decision node. This tree takes an input an object and outputs some decision. All Paths from root node to the leaf node are reached by either using AND or OR or BOTH. The tree is constructed using the regularities of the data. The decision tree is not affected by Automatic Data Preparation.

12. What Is Naive Bayes Algorithm?

Ans:

Naive Bayes Algorithm is used to generate mining models. These models help to identify relationships between input columns and the predictable columns. This algorithm can be used in the initial stage of exploration. The algorithm calculates the probability of every state of each input column given predictable columns possible states. After the model is made, the results can be used for exploration and making predictions.

13. Explain Clustering Algorithm?

Ans:

Clustering algorithm is used to group sets of data with similar characteristics also called as clusters. These clusters help in making faster decisions, and exploring data. The algorithm first identifies relationships in a dataset following which it generates a series of clusters based on the relationships. The process of creating clusters is iterative. The algorithm redefines the groupings to create clusters that better represent the data.

14. What Is Time Series Algorithm In Data Mining?

Ans:

Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. The algorithm generates a model that can predict trends based only on the original dataset. New data can also be added that automatically becomes a part of the trend analysis.
E.g. Performance one employee can influence or forecast the profit.

15. Explain Association Algorithm In Data Mining?

Ans:

Association algorithm is used for recommendation engine that is based on a market based analysis. This engine suggests products to customers based on what they bought earlier. The model is built on a dataset containing identifiers. These identifiers are both for individual cases and for the items that cases contain. These groups of items in a data set are called as an item set. The algorithm traverses a data set to find items that appear in a case. MINIMUM_SUPPORT parameter is used any associated items that appear into an item set.

16. What Is Sequence Clustering Algorithm?

Ans:

Sequence clustering algorithm collects similar or related paths, sequences of data containing events. The data represents a series of events or transitions between states in a dataset like a series of web clicks. The algorithm will examine all probabilities of transitions and measure the differences, or distances, between all the possible sequences in the data set. This helps it to determine which sequence can be the best for input for clustering.
E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a retail ware house.

17. Explain The Concepts And Capabilities Of Data Mining?

Ans:

Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. it is more commonly used to transform large amount of data into a meaningful form. Data here can be facts, numbers or any real time information like sales figures, cost, meta data etc. Information would be the patterns and the relationships amongst the data that can provide information.

18. Explain How To Work With The Data Mining Algorithms Included In Sql Server Data Mining?

Ans:

SQL Server data mining offers Data Mining Add-ins for office 2007 that allows discovering the patterns and relationships of the data. This also helps in an enhanced analysis. The Add-in called as Data Mining client for Excel is used to first prepare data, build, evaluate, manage and predict results.

19. Explain How To Use Dmx-the Data Mining Query Language.

Ans:

Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly used to create and manage the data mining models. DMX comprises of two types of statements: Data definition and Data manipulation. Data definition is used to define or create new models, structures.
Example:
CREATE MINING SRUCTURE
CREATE MINING MODEL
Data manipulation is used to manage the existing models and structures.
Example:
INSERT INTO
SELECT FROM .CONTENT (DMX)

20. Explain How To Mine An Olap Cube?

Ans:

A data mining extension can be used to slice the data the source cube in the order as discovered by data mining. When a cube is mined the case table is a dimension.




 

21. What Are The Different Ways Of Moving Data/databases Between Servers And Databases In Sql Server?

Ans:

There are several ways of doing this. One can use any of the following options:
– BACKUP/RESTORE,
– Dettaching/attaching databases,
– Replication,
– DTS,
– BCP,
– logshipping,
– INSERT…SELECT,
– SELECT…INTO,
– creating INSERT scripts to generate data.

22. What Are The Benefits Of User-defined Functions?

Ans:

a. Can be used in a number of places without restrictions as compared to stored procedures.
b. Code can be made less complex and easier to write.
c. Parameters can be passed to the function.
d. They can be used to create joins and also be sued in a select, where or case statement.
e. Simpler to invoke.

23. Define Pre Pruning?

Ans:

A tree is pruned by halting its construction early. Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset samples.

24. What Are Interval Scaled Variables?

Ans:

Interval scaled variables are continuous measurements of linear scale. For example, height and weight, weather temperature or coordinates for any cluster. These measurements can be calculated using Euclidean distance or Minkowski distance.

25. What Is A Sting?

Ans:

Statistical Information Grid is called as STING; it is a grid based multi resolution clustering method. In STING method, all the objects are contained into rectangular cells, these cells are kept into various levels of resolutions and these levels are arranged in a hierarchical structure.

26. What Is A Dbscan?

Ans:

Density Based Spatial Clustering of Application Noise is called as DBSCAN. DBSCAN is a density based clustering method that converts the high-density objects regions into clusters with arbitrary shapes and sizes. DBSCAN defines the cluster as a maximal set of density connected points.

27. Define Density Based Method?

Ans:

Density based method deals with arbitrary shaped clusters. In density-based method, clusters are formed on the basis of the region where the density of the objects is high.

28. Define Chameleon Method?

Ans:

Chameleon is another hierarchical clustering method that uses dynamic modeling. Chameleon is introduced to recover the drawbacks of CURE method. In this method two clusters are merged, if the interconnectivity between two clusters is greater than the interconnectivity between the objects within a cluster.

29. What Do U Mean By Partitioning Method?

Ans:

In partitioning method a partitioning algorithm arranges all the objects into various partitions, where the total number of partitions is less than the total number of objects. Here each partition represents a cluster. The two types of partitioning method are k-means and k-medoids.

30. Define Genetic Algorithm?

Ans:

Enables us to locate optimal binary string by processing an initial random population of binary strings by performing operations such as artificial mutation , crossover and selection.


 

31. What Is Ods?

Ans:

1. ODS means Operational Data Store.
2. A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed, and loaded into an enterprise data architecture. An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse. The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases.

32. What Is Spatial Data Mining?

Ans:

Spatial data mining is the application of data mining methods to spatial data. Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography. So far, data mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions and approaches to visualization and data analysis. Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasises the importance of developing data driven inductive approaches to geographical analysis and modeling.
Data mining, which is the partially automated search for hidden patterns in large databases, offers great potential benefits for applied GIS-based decision-making. Recently, the task of integrating these two technologies has become critical, especially as various public and private sector organizations possessing huge databases with thematic and geographically referenced data begin to realise the huge potential of the information hidden there. Among those organizations are:
* offices requiring analysis or dissemination of geo-referenced statistical data
* public health services searching for explanations of disease clusters
* environmental agencies assessing the impact of changing land-use patterns on climate change
* geo-marketing companies doing customer segmentation based on spatial location.

33. What Is Smoothing?

Ans:

Smoothing is an approach that is used to remove the nonsystematic behaviors found in time series. It usually takes the form of finding moving averages of attribute values. It is used to filter out noise and outliers.

34. What Are The Advantages Data Mining Over Traditional Approaches?

Ans:

Data Mining is used for the estimation of future. For example if we take a company/business organization by using the concept of Data Mining we can predict the future of business interms of Revenue (or) Employees (or) Cutomers (or) Orders etc.
Traditional approches use simple algorithms for estimating the future. But it does not give accurate results when compared to Data Mining.

35. What Is Model Based Method?

Ans:

For optimizing a fit between a given data set and a mathematical model based methods are used. This method uses an assumption that the data are distributed by probability distributions. There are two basic approaches in this method that are
1. Statistical Approach
2. Neural Network Approach.

36. What Is An Index?

Ans:

Indexes of SQL Server are similar to the indexes in books. They help SQL Server retrieve the data quicker. Indexes are of two types. Clustered indexes and non-clustered indexes. Rows in the table are stored in the order of the clustered index key.
There can be only one clustered index per table.
Non-clustered indexes have their own storage separate from the table data storage.
Non-clustered indexes are stored as B-tree structures.
Leaf level nodes having the index key and it’s row locater.

37. Mention Some Of The Data Mining Techniques?

Ans:

Statistics
Machine learning
Decision Tree
Hidden markov models
Artificial Intelligence
Genetic Algorithm
Meta learning

38. Define Binary Variables? And What Are The Two Types Of Binary Variables?

Ans:

Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when state is 1, variable is present. There are two types of binary variables, symmetric and asymmetric binary variables. Symmetric variables are those variables that have same state values and weights. Asymmetric variables are those variables that have not same state values and weights.

39. Explain The Issues Regarding Classification And Prediction?

Ans:

Preparing the data for classification and prediction:

  • Data cleaning
  • Relevance analysis
  • Data transformation
  • Comparing classification methods
  • Predictive accuracy
  • Speed
  • Robustness
  • Scalability
  • Interpretability

40. What Are Non-additive Facts?

Ans:

Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.



 

41. What Is Meteorological Data?

Ans:

Meteorology is the interdisciplinary scientific study of the atmosphere. It observes the changes in temperature, air pressure, moisture and wind direction. Usually, temperature, pressure, wind measurements and humidity are the variables that are measured by a thermometer, barometer, anemometer, and hygrometer, respectively. There are many methods of collecting data and Radar, Lidar, satellites are some of them.
Weather forecasts are made by collecting quantitative data about the current state of the atmosphere. The main issue arise in this prediction is, it involves high-dimensional characters. To overcome this issue, it is necessary to first analyze and simplify the data before proceeding with other analysis. Some data mining techniques are appropriate in this context.

42. Define Descriptive Model?

Ans:

It is used to determine the patterns and relationships in a sample data. Data mining tasks that belongs to descriptive model:
Clustering
Summarization
Association rules
Sequence discovery

43. What Is A Star Schema?

Ans:

Star schema is a type of organising the tables such that we can retrieve the result from the database easily and fastly in the warehouse environment.Usually a star schema consists of one or more dimension tables around a fact table which looks like a star,so that it got its name.

44. What Are The Steps Involved In Kdd Process?

Ans:

Data cleaning
Data Mining
Pattern Evaluation
Knowledge Presentation
Data Integration
Data Selection
Data Transformation

45. What Is A Lookup Table?

Ans:

A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition.

46. What Is Attribute Selection Measure?

Ans:

The information Gain measure is used to select the test attribute at each node in the decision tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split.

47. Explain Statistical Perspective In Data Mining?

Ans:

Point estimation
Data summarization
Bayesian techniques
Hypothesis testing
Regression
Correlation

48. Define Wave Cluster?

Ans:

It is a grid based multi resolution clustering method. In this method all the objects are represented by a multidimensional grid structure and a wavelet transformation is applied for finding the dense region. Each grid cell contains the information of the group of objects that map into a cell. A wavelet transformation is a process of signaling that produces the signal of various frequency sub bands.

49. What Is Time Series Analysis?

Ans:

A time series is a set of attribute values over a period of time. Time Series Analysis may be viewed as finding patterns in the data and predicting future values.

50. Explain Mining Single ?dimensional Boolean Associated Rules From Transactional Databases?

Ans:

The apriori algorithm: Finding frequent itemsets using candidate generation Mining frequent item sets without candidate generation.




 

51. What Is Meta Learning?

Ans:

Concept of combining the predictions made from multiple models of data mining and analyzing those predictions to formulate a new and previously unknown prediction.

52. Describe Important Index Characteristics?

Ans:

The characteristics of the indexes are:
* They fasten the searching of a row.
* They are sorted by the Key values.
* They are small and contain only a small number of columns of the table.
* They refer for the appropriate block of the table with a key value.

53. What Is The Use Of Regression?

Ans:

Regression can be used to solve the classification problems but it can also be used for applications such as forecasting. Regression can be performed using many different types of techniques; in actually regression takes a set of data and fits the data to a formula.

54. What Is Dimensional Modelling? Why Is It Important ?

Ans:

Dimensional Modelling is a design concept used by many data warehouse desginers to build thier data warehouse. In this design model all the data is stored in two types of tables – Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the dimensions on which the facts are calculated.

55. What Is Unique Index?

Ans:

Unique index is the index that is applied to any column of unique value.
A unique index can also be applied to a group of columns.

56. What Are The Foundations Of Data Mining?

Ans:

Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:
* Massive data collection
* Powerful multiprocessor computers
* Data mining algorithms
Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.

57. What Snow Flake Schema?

Ans:

Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.

58. Differences Between Star And Snowflake Schemas?

Ans:

Star schema – all dimensions will be linked directly with a fat table.
Snow schema – dimensions maybe interlinked or may have one-to-many relationship with other tables.

59. What Is Hierarchical Method?

Ans:

Hierarchical method groups all the objects into a tree of clusters that are arranged in a hierarchical order. This method works on bottom-up or top-down approaches.

60. What Is Cure?

Ans:

Clustering Using Representatives is called as CURE. The clustering algorithms generally work on spherical and similar size clusters. CURE overcomes the problem of spherical and similar size cluster and is more robust with respect to outliers.


 

61. What Is Etl?

Ans:

ETL stands for extraction, transformation and loading.
ETL provide developers with an interface for designing source-to-target mappings, ransformation and job control parameter.
*Extraction
Take data from an external source and move it to the warehouse pre-processor database.
*Transformation
Transform data task allows point-to-point generating, modifying and transforming data.
*Loading
Load data task adds records to a database table in a warehouse.

62. Define Rollup And Cube?

Ans:

Custom rollup operators provide a simple way of controlling the process of rolling up a member to its parents values.The rollup uses the contents of the column as custom rollup operator for each member and is used to evaluate the value of the member’s parents.
If a cube has multiple custom rollup formulas and custom rollup members, then the formulas are resolved in the order in which the dimensions have been added to the cube.

63. What Are The Different Problems That “data Mining” Can Solve?

Ans:

*Data mining helps analysts in making faster business decisions which increases revenue with lower costs.
*Data mining helps to understand, explore and identify patterns of data.
*Data mining automates process of finding predictive information in large databases.
*Helps to identify previously hidden patterns.

64. What Are Different Stages Of “data Mining”?

Ans:

Exploration: This stage involves preparation and collection of data. it also involves data cleaning, transformation. Based on size of data, different tools to analyze the data may be required. This stage helps to determine different variables of the data to determine their behavior.
Model building and validation: This stage involves choosing the best model based on their predictive performance. The model is then applied on the different data sets and compared for best performance. This stage is also called as pattern identification. This stage is a little complex because it involves choosing the best pattern to allow easy predictions.
Deployment: Based on model selected in previous stage, it is applied to the data sets. This is to generate predictions or estimates of the expected outcome.

65. Explain How To Use Dmx-the Data Mining Query Language?

Ans:

Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly used to create and manage the data mining models. DMX comprises of two types of statements: Data definition and Data manipulation. Data definition is used to define or create new models, structures.
Example:
CREATE MINING SRUCTURE
CREATE MINING MODEL
Data manipulation is used to manage the existing models and structures.
Example:
INSERT INTO
SELECT FROM .CONTENT (DMX)

66. Explain how to mine an OLAP cube.

Ans:

A data mining extension can be used to slice the data the source cube in the order as discovered by data mining. When a cube is mined the case table is a dimension.

67. Explain what are the different storage models that are available in OLAP?

Ans:

The different storage models that are available in OLAP are as follows:
1. MOLAP – Multidimensional Online Analytical Processing
2. ROLAP – Relational online Analytical processing
3. HOLAP – Hybrid online Analytical Processing
They are advantages and disadvantages of with each of these storages models that are available in OLAP.

68. Explain in detail what is MOLAP? What are the advantages and disadvantages?

Ans:

>> As the name itself depicts “MOLAP” , i.e. Multidimensional.
>> In this type of data storage, the data is stored in multidimensional cubes and not in the standard relational databases.
The advantage of using MOLAP is:
The query performance is excellent, this is because the data is stored in multidimensional cubes. Also the calculations are pre generated when a cube is created.
The disadvantage of using MOLAP is:
1. Only limited amount of data can be stored. Since the calculations are triggered at the cube generation process it cannot withstand huge amount of data.
2. Needs a lot of skill to utilize this.
3. Also it has licensing cost associated to it.

69. Explain in detail what is ROLAP? What are the advantages and disadvantages?

Ans:

As the name suggests that, the data is stored in the form of relational databases.
The advantages of using ROLAP is:
1. As the data is stored in relational databases, it can handle huge amount of data storage.
2. All the functionalities are available as this is a relational database.
The disadvantages of using ROLAP is:
1. It is comparatively slow.
2. All the limitations that apply to SQL , the same applies to ROLAP too

70. Explain in detail what is HOLAP? What are the advantages of using this type of data storage?

Ans:

HOLAP stands for Hybrid online analytical processing.
Actually it is a combination of MOLAP and ROLAP.
The advantages of using MOLAP is:
1. In this model, the cube is used to get summarized information.
2. For drill down capabilities it uses ROLAP structure.



 

71. Explain the storage models of OLAP.

Ans:

MOLAP Multidimensional Online Analytical processing
In MOLAP data is stored in form of multidimensional cubes and not in relational databases.
Advantage
Excellent query performance as the cubes have all calculations pre-generated during creation of the cube.
Disadvantages
It can handle only a limited amount of data. Since all calculations have been pre-generated, the cube cannot be created from a large amount of data.
It requires huge investment as cube technology is proprietary and the knowledge base may not exist in the organization.
ROLAP Relational Online Analytical processing
The data is stored in relational databases.
Advantages
It can handle a large amount of data and
It provides all the functionalities of the relational database.
Disadvantages
It is slow.
The limitations of the SQL apply to the ROLAP too.
HOLAP Hybrid Online Analytical processing
HOLAP is a combination of the above two models. It combines the advantages in the following manner:
For summarized information it makes use of the cube.
For drill down operations, it uses ROLAP.

72. Differentiate between Data Mining and Data warehousing.

Ans:

Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc.

73. What is Data purging?

Ans:

Deleting data from data warehouse is known as data purging. While loading data into staging or in the target table fresh data loading may be needed every time. In this scenario, data purging is needed to stage or target table prior to loading fresh data. Usually junk data like rows with null values or spaces are cleaned up. Data purging is the process of cleaning this kind of junk values.

74. What are CUBES?

Ans:

A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.

75. Explain what are the different problems that “Data mining” can solve?

Ans:

Data mining can be used in a variety of fields/industries like marketing, advertising of goods, products, services, AI, government intelligence.
The US Federal Bureau of Investigation uses data mining for screening security and intelligence for identifying illegal and incriminating e-information distributed over internet.

76. What are OLAP and OLTP?

Ans:

OLTP: Online Transaction and Processing helps and manages applications based on transactions involving high volume of data. Typical example of a transaction is commonly observed in Banks, Air tickets etc. Because OLTP uses client server architecture, it supports transactions to run cross a network.
OLAP: Online analytical processing performs analysis of business data and provides the ability to perform complex calculations on usually low volumes of data. OLAP helps the user gain an insight on the data coming from different sources (multi dimensional).

77. What is Rollup and cube?

Ans:

Custom rollup operators provide a simple way of controlling the process of rolling up a member to its parents values.The rollup uses the contents of the column as custom rollup operator for each member and is used to evaluate the value of the member’s parents.
If a cube has multiple custom rollup formulas and custom rollup members, then the formulas are resolved in the order in which the dimensions have been added to the cube.

78. What are different stages of “Data mining”?

Ans:

Exploration: This stage involves preparation and collection of data. it also involves data cleaning, transformation. Based on size of data, different tools to analyze the data may be required. This stage helps to determine different variables of the data to determine their behavior.
Model building and validation: This stage involves choosing the best model based on their predictive performance. The model is then applied on the different data sets and compared for best performance. This stage is also called as pattern identification. This stage is a little complex because it involves choosing the best pattern to allow easy predictions.
Deployment: Based on model selected in previous stage, it is applied to the data sets. This is to generate predictions or estimates of the expected outcome.

79. Do you know how does the data mining and data warehousing work together?

Ans:

Data warehousing can be used for analyzing the business needs by storing data in a meaningful form. Using Data mining, one can forecast the business needs. Data warehouse can act as a source of this forecasting.

80. What is Discrete and Continuous data in Data mining world?

Ans:

Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age.




 

81. Can you explain what is MODEL in Data mining world?

Ans:

Models in Data mining help the different algorithms in decision making or pattern matching. The second stage of data mining involves considering various models and choosing the best one based on their predictive performance.

82. What is a Decision Tree Algorithm?

Ans:

A decision tree is a tree in which every node is either a leaf node or a decision node. This tree takes an input an object and outputs some decision. All Paths from root node to the leaf node are reached by either using AND or OR or BOTH. The tree is constructed using the regularities of the data. The decision tree is not affected by Automatic Data Preparation.

83. What is Naive Bayes Algorithm?

Ans:

Naïve Bayes Algorithm is used to generate mining models. These models help to identify relationships between input columns and the predictable columns. This algorithm can be used in the initial stage of exploration. The algorithm calculates the probability of every state of each input column given predictable columns possible states. After the model is made, the results can be used for exploration and making predictions.

84. Explain clustering algorithm.

Ans:

Clustering algorithm is used to group sets of data with similar characteristics also called as clusters. These clusters help in making faster decisions, and exploring data. The algorithm first identifies relationships in a dataset following which it generates a series of clusters based on the relationships. The process of creating clusters is iterative. The algorithm redefines the groupings to create clusters that better represent the data.

85. What is Time Series algorithm in data mining?

Ans:

Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. The algorithm generates a model that can predict trends based only on the original dataset. New data can also be added that automatically becomes a part of the trend analysis.
E.g. Performance one employee can influence or forecast the profit.

86. Explain Association algorithm in Data mining.

Ans:

Association algorithm is used for recommendation engine that is based on a market based analysis. This engine suggests products to customers based on what they bought earlier. The model is built on a dataset containing identifiers. These identifiers are both for individual cases and for the items that cases contain. These groups of items in a data set are called as an item set. The algorithm traverses a data set to find items that appear in a case. MINIMUM_SUPPORT parameter is used any associated items that appear into an item set.

87. What is Sequence clustering algorithm?

Ans:

Sequence clustering algorithm collects similar or related paths, sequences of data containing events. The data represents a series of events or transitions between states in a dataset like a series of web clicks. The algorithm will examine all probabilities of transitions and measure the differences, or distances, between all the possible sequences in the data set. This helps it to determine which sequence can be the best for input for clustering.
E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a retail ware house.

88. Explain the concepts and capabilities of data mining.

Ans:

Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. it is more commonly used to transform large amount of data into a meaningful form. Data here can be facts, numbers or any real time information like sales figures, cost, meta data etc. Information would be the patterns and the relationships amongst the data that can provide information.

89. Explain how to work with the data mining algorithms included in SQL Server data mining.

Ans:

SQL Server data mining offers Data Mining Add-ins for office 2007 that allows discovering the patterns and relationships of the data. This also helps in an enhanced analysis. The Add-in called as Data Mining client for Excel is used to first prepare data, build, evaluate, manage and predict results.

90. Explain how to use DMX-the data mining query language.

Ans:

Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly used to create and manage the data mining models. DMX comprises of two types of statements: Data definition and Data manipulation. Data definition is used to define or create new models, structures.
Example:
CREATE MINING SRUCTURE
CREATE MINING MODEL
Data manipulation is used to manage the existing models and structures.
Example:
INSERT INTO
SELECT FROM .CONTENT (DMX)


 

91. What are foundations of data mining?

Ans:

Generally, we use it for a long process of research and product development. Also, we can say this evolution was started when business data was first stored on computers. We can also navigate through their data in real time. Data Mining is also popular in the business community. As this is supported by three technologies that are now mature: Massive data collection, Powerful multiprocessor computers, and Data mining algorithms.

92. What is a scope of data mining?

Ans:

a. Automated prediction of trends and behaviors- We use to automate the process of finding predictive information in large databases. Also, questions that required extensive hands-on analysis can now be answered from the data. Moreover, targeted marketing is a typical example of predictive marketing. As we also use data mining on past promotional mailings.
b. Automated discovery of previously unknown patterns – As we use data mining tools to sweep through databases. Also, to identify previously hidden patterns in one step. Basically, there is a very good example of pattern discovery. As it is the analysis of retail sales data. Moreover, that is to identify unrelated products that are often purchased together.

93 What are advantages of data mining?

Ans:

Basically, to find probable defaulters, we use data mining in banks and financial institutions. Also, this is done based on past transactions, user behavior and data patterns.
Generally, it helps advertisers to push right advertisements to the internet. Also, it surfer on web pages based on machine learning algorithms. Moreover, this way data mining benefit both possible buyers as well as sellers of the various products.
Basically, the retail malls and grocery stores peoples used it. Also, it is to arrange and keep most sellable items in the most attentive positions.

94. What are cons of data mining?

Ans:

Security: The time at which users are online for various uses, must be important. They do not have security systems in place to protect us. As some of the data mining analytics use software. That is difficult to operate.Thus they require a user to have knowledge based training. The techniques of data mining are not 100% accurate. Hence, it may cause serious consequences in certain conditions.

95 Name Data mining techniques?

Ans:

a. Classification Analysis
b. Association Rule Learning
c. Anomaly or Outlier Detection
d. Clustering Analysis
e. Regression Analysis
f. Prediction
g. Sequential Patterns
h. Decision trees

96. Give a brief introduction to data mining process?

Ans:

Basically, data mining is the latest technology. Also, it is a process of discovering hidden valuable knowledge by analyzing a large amount of data. Moreover. we have to store that data in different databases. As data mining is a very important process. It becomes an advantage for various industries.

97. Name types of data mining?

Ans:

a. Data cleaning
b. Integration
c. Selection
d. Data transformation
e. Data mining
f. Pattern evaluation
g. Knowledge representation

98. Name the steps used in data mining?

Ans:

a. Business understanding
b. Data understanding
c. Data preparation
d. Modeling
e. Evaluation
f. Deployment

99. Name areas of applications of data mining?

Ans:

a. Data Mining Applications for Finance
b. Healthcare
c. Intelligence
d. Telecommunication
e. Energy
f. Retail
g. E-commerce
h. Supermarkets
i. Crime Agencies
j. Businesses Benefit from data mining

100. What is required technological drivers in data mining?

Ans:

Database size: Basically, as for maintaining and processing the huge amount of data, we need powerful systems.
Query Complexity: Generally, to analyze the complex and large number of queries, we need a more powerful system.



 

101. Give an introduction to data mining query language?

Ans:

It was proposed by Han, Fu, Wang, et al. for the DBMiner data mining system. Although, it was based on the Structured Query Language. These query languages are designed to support ad hoc and interactive data mining. Also, it provides commands for specifying primitives. We can use DMQL to work with databases and data warehouses as well. We can also use it to define data mining tasks. Particularly we examine how to define data warehouses and data marts in DMQL.

102. What is Syntax for Task-Relevant Data Specification?

Ans:

Syntax of DMQL for specifying task-relevant data –
use database database_name
or
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list

103. What is Syntax for Specifying the Kind of Knowledge?

Ans:

Syntax for Characterization, Discrimination, Association, Classification, and Prediction.

104. Explain Syntax for Interestingness Measures Specification?

Ans:

Interestingness measures and thresholds can be specified by the user with the statement – with threshold = threshold_value

105. Explain Syntax for Pattern Presentation and Visualization Specification?

Ans:

Generally, we have a syntax, which allows users to specify the display of discovered patterns in one or more forms. display as

106. Explain Data Mining Languages Standardization?

Ans:

This will serve the following purposes –
Basically, it helps the systematic development of data mining solutions.
Also, improves interoperability among multiple data mining systems and functions.
Generally, it helps in promoting education and rapid learning.
Also, promotes the use of data mining systems in industry and society.

107. Explain useful data mining queries?

Ans:

First of all, it helps to apply the model to new data, to make single or multiple predictions.
Also, you can provide input values as parameters, or in a batch.
While, it gets a statistical summary of the data used for training. Also, extract patterns and rule of the typical case representing a pattern in the model.
Also, helps in extracting regression formulas and other calculations that explain patterns.
Get the cases that fit a particular pattern.
Further, it retrieves details about individual cases used in the model.
Also, it includes data not used in the analysis. Moreover, it retrains a model by adding new data or perform cross-prediction.

108. Give a brief introduction to data mining knowledge discovery?

Ans:

Generally, most people don’t differentiate data mining from knowledge discovery. While others view data mining as an essential step in the process of knowledge discovery.

109. Explain steps involved in data mining knowledge process?

Ans:

Data Cleaning –
Basically, in this step, the noise and inconsistent data are removed.
Data Integration –
Moreover, in this step, multiple data sources are combined.
Data Selection –
Furthermore, in this step, data relevant to the analysis task are retrieved from the database.
Data Transformation –
Basically, in this step, data is transformed into forms appropriate for mining. Also, by performing summary or aggregation operations.
Data Mining –
In this, intelligent methods are applied in order to extract data patterns.
Pattern Evaluation –
While, in this step, data patterns are evaluated.
Knowledge Presentation –
Generally, in this step, knowledge is represented

110. What are issues in data mining?

Ans:

A number of issues that need to be addressed by any serious data mining package
Uncertainty Handling
Dealing with Missing Values
Dealing with Noisy data
Efficiency of algorithms
Constraining Knowledge Discovered to only Useful
Incorporating Domain Knowledge
Size and Complexity of Data
Data Selection




 

111. What are major elements of data mining, explain?

Ans:

Generally, helps in an extract, transform and load transaction data onto the data warehouse system.
While it stores and manages the data in a multidimensional database system.
Also, provide data access to business analysts and information technology professionals.
Generally, analyze the data by application software.
While, it shows the data in a useful format, such as a graph or table

112. Name different level of analysis of data mining?

Ans:

a. Artificial Neural Networks
b. Genetic algorithms
c. Nearest neighbor method
d. Rule induction
e Data visualization

113. Name methods of classification methods?

Ans:

a. Statistical Procedure Based Approach
b Machine Learning Based Approach
c. Neural Network
d. Classification Algorithms
e. ID3 Algorithm
f. C4.5 Algorithm
g. K Nearest Neighbors Algorithm
H. Naïve Bayes Algorithm
i. SVM Algorithm
J. ANN Algorithm
K. 48 Decision Trees
l. Support Vector Machines
M. SenseClusters (an adaptation of the K-means clustering algorithm)

114. Explain Statistical Procedure Based Approach?

Ans:

Especially, there are two main phases present to work on classification. Also, it can be easily identified within the statistical community.
While, the second, “modern” phase concentrated on more flexible classes of models. Also, in which many of which attempt has to take. Moreover, it provides an estimate of the joint distribution of the feature within each class. Further, that can, in turn, provide a classification rule.
Generally, statistical procedures have to characterize by having a precise fundamental probability model and that is used to provides a probability of being in each class instead of just a classification.
Also, we can assume that the techniques will use by statisticians. Hence some human involvement has to assume with regard to variable selection. Also, transformation and overall structuring of the problem.

115. Explain Machine Learning Based Approach?

Ans:

Generally, it covers automatic computing procedures. Also, it was based on logical or binary operations. Further, we use to learn a task from a series of examples.
Here, we have to focus on decision-tree approaches. Also, ss classification results come from a sequence of logical steps.
Also, its principle would allow us to deal with more general types of data including cases. While, the number and type of attributes may vary.

116. Explain ID3 Algorithm?

Ans:

Generally, the id3 calculation starts with the original set as the root hub. Also, on every cycle, it emphasizes through every unused attribute of the set and figures. Moreover, the entropy of attribute. Furthermore, at that point chooses the attribute. Also, it has the smallest entropy value.

117. Name methods of clustering?

Ans:

They are classified into the following categories –
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method

118. What do OLAP and OLTP stand for?

Ans:

Basically, OLAP is an acronym for Online Analytical Processing and OLTP is an acronym for Online Transactional Processing.

119. Define metadata?

Ans:

Basically, metadata is simply defined as data about data. In other words, we can say that metadata is the summarized data that leads us to the detailed data.

120. List the types of OLAP server?

Ans:

Basically, there are four types of OLAP servers, namely Relational OLAP, Multidimensional OLAP, Hybrid OLAP, and Specialized SQL Servers.


 

121. Explain the main difference between Data Mining and Data Warehousing?

Ans:

Data Warehousing:
It is a process where the data is extracted from various sources. Further, the data is cleansed and stored.
Data Mining:
1. It is a process where it explores the data using the queries.
2. Basically, the queries are used to explore a particular data set and examine the results. This will help the individual in reporting, strategy planning, visualizing meaningful data sets.
The above can be explained by taking a simple example:
1. Let’s take a software company where all of their projects information is stored. This is nothing but Data Warehousing.
2. Accessing a particular project and identifying the Profit and Loss statement for that project can be considered as Data Mining.

122. Explain in detail what does CUBE mean?

Ans:

Cube is nothing but a data storage place where the data can be stored and makes it easier for the user to deal with his/her reporting tasks. It helps in expeditie data analysis process.
For example:
Let’s say the data related to an employee is stored in the form of cube. If you are evaluating the user performance based on weekly, monthly basis then week and month are considered to be the dimensions of the cube.

123. What are the different problems that “Data Mining” can solve in general?

Ans:

Data Mining is a very important process where it could be used to validate and screen the data how it is coming through and the process can be defined based on the data mining results. By doing these activities, the existing process can be modified.
They are widely used in the following industries :
1. Marketing
2. Advertising
3. Services
4. Artificial Intelligence
5. Government intelligence
By following the standard principles a lot of illegal activities can be identified and dealt with. As the internet has evolved a lot of loops holes also evolved at the same time.

124. Explain the difference between OLAP and OLTP?

Ans:

OLTP:
1. OLTP stands for Online Transaction and Processing.
2. This is useful in the applications which involves in a lot of transactions and high volumes of data. This type of applications are mainly observed in Banking sectors, Air ticketing etc. The architecture used in OLTP is Client server architecture. It actually supports the transactions cross network as well.
OLAP:
1. OLAP stands for Online Analytical Processing.
2. It is widely used in applications where we need to support business data where complex calculations happen. Most of the time, the data is in low volumes. As this is being multidimensional database, the user will have insight of how the data is coming through the various sources.

125. Explain the different stages of “Data Mining”?

Ans:

They are three different stages in Data Mining, they are as follows:
1. Exploration
2. Model building and validation
3. Deployment
Exploration is a stage where a lot of activities revolve around preparation and collection of different data sets. So activities like cleaning, transformation are also included. Based on the data sets available , different tools are necessary to analyze the data.
Model Building and validation:
In this stage, the data sets is validated by applying different models where the data sets are compared for best performance. This particular step is called as pattern identification. This is a tedious process because the user has to identify which pattern is best suitable for easy predictions.
Deployment:
Based on the previous step, the best pattern is applied for the data sets and it is used to generate predictions and it helps in estimating expected outcomes.

126. Explain what is Discrete and continuous data concepts in Data Mining world?

Ans:

Discrete data can be classified as a defined data or a finite data. That has a meaning to itself. For example: Mobile numbers, gender.
Continuous data is nothing but a data that continuous changes in an orderly fashion. The example for continuous data is “Age”.

127. Explain what is MODEL in terms of Data Mining subject?

Ans:

Model is an important factor in Data Mining activities, it defines and helps the algorithms in terms of making decisions and pattern matching. The second step to is that they evaluate different models that are available and select a best suitable model for the validating the data sets.

128. Explain what is Naive Bayes Algorithm?

Ans:

The Naive Bayes Algorithm is widely used to generate mining models. These models are generally used to identify the relationship between the input columns and the predicated columns that are available. This algorithm is widely used during the initial stages of the explorations.

129. Explain in detail about Clustering Algorithm?

Ans:

1. The clustering algorithm is actually used on groups of data sets are available with a common characteristics, they are called as clusters.
2. As the clusters are formed, it helps to make faster decisions and exporting the data is also fast.
3. First of all the algorithm identifies the relationships that are available in the dataset and based on that it generates clusters. The process of creating clusters is also repetitive.

130. Explain what is time series algorithm in data mining?

Ans:

1. This algorithm is a perfect fit for type of data where the values changes continuous based on the time. For example : Age
2. If the algorithm is skilled and tuned to predict the data set, then it will be successfully keep a track of the continuous data and predict the right data.
3. This algorithm generates a specific model which is capable of predicting the future trends of the the data based on the real original data sets.
4. In between the process new data can also be added in part of trend analysis.



 

131. What is sequence clustering algorithm?

Ans:

As the name itself states that the data is collected at different points which occurs at sequence of events. The different data sets are analyzed based on the sequence of data sets that occur based on the events. The data sets are analyzed and then best possible data input will be determined for clustering.
Example:
A sequence clustering algorithm will help the organization to specific a particular path to introduce a new product which has similar characteristics in a retail warehouse.

132. What are the different concepts and capabilities of Data Mining?

Ans:

So Data Mining is primarily responsible to understand and get meaningful data from the data sets that are stored in the database.
In terms of exploring the data in data mining is definitely helpful because it can be used in the following areas:
1. Reporting
2. Planning
3. Strategies
4. Meaningful Patterns etc.
A large amount of data is cleaned as per the requirement and can be transformed into a meaningful data which can be helpful for decision making at the executive level.
Data mining is really helpful with the following types of data:
1. Data sets which are in the form of sales figures
2. Forecast values for the business projection
3. Cost
4. Metadata etc
Based on the data analyzed, the information can be analyzed and appropriate relationships are defined.

133. What is the best way to work with data mining algorithms that are included in SQL Server data mining?

Ans:

With the use of SQL Server data mining offers an add on for MS office 2007. This will help to identify and discover the relationships with the data. This data is helpful in future for enhanced analysis.
The add on is called as “ Data Mining client for excel”. With this the users will be able to first prepare data, build and further manage and evaluate the data where the final output will predicting results.

134. How to use DMX- the data mining query language in detail?

Ans:

DMX consists of two types of statements in general.
Data definition and Data Manipulation.
Data Definition:
This is used to define and create new models and structures.
Data Manipulation:
As the name itself depicts, the data is manipulated based on the requirement.
The usage is explained in detail by picking up an example:
1. Create Mining Structure
2. Create Mining Model
3. Data Manipulation that is used in existing structures and models.
With the syntax, it is
INSERT INTO
SELECT FROM. CONTENT (DMX)

135. What are the different functions of data mining?

Ans:

The different functions of data mining are as follows:
1. Characterization
2. Association and correlation analysis
3. Classification
4. Prediction
5. Cluster analysis
6. Evolution analysis
7. Sequence analysis

136. Explain in detail what is data aggregation and Generalization?

Ans:

Data Aggregation:
As the name itself is self explanatory , the data is aggregated altogether where a cube can be constructed for data analysis purposes.
Generalization:
It is a process where low level data is replaced by high level concept so the data can be generalized and meaningful.

137. Explain in detail about In Learning and Inclassification:

Ans:

In Learning:
This is a model which is primarily used to analyze a particular training data set and it has training data samples that are selected from a selected population.
In Classification:
This model is primarily used for providing an estimation for a particular class by selecting test samples randomly. The term classification is usually determined by identifying a known class for a specific unknown data.

138. Explain in detail what is Cluster Analysis?

Ans:

The term cluster analysis is an important human activity which is widely used in different applications. To be specific, this type of analysis is used in market research, pattern recognition, data analysis and image processing.

139. Explain about data mining interface?

Ans:

The data mining interface is usually used for improving the quality of the queries that are used.
The data mining Interface is nothing but the GUI form for data mining activities.

140. Why Tuning data warehouse is needed, explain in detail?

Ans:

The main aspect of data warehouse is that the data evolves based on the time frame and it is difficult to predict the behaviour because of its ad hoc environment. The database tuning is much difficult in an OLTP environment because of its ad hoc and real time transaction loads. Due to its nature, the need to data warehouse tuning is necessary and it will change the way how the data is utilized based on the need.




 

141. Explain in detail about association algorithm in Data mining?

Ans:

This algorithm is mainly used in recommendation engine for a specific market based analysis.
So the input for this algorithm would be the products or items that are bought by a specific customer, based on that purchase a recommendation engine will predict the best suitable products for the customers.

142. Explain in detail what is Data Purging?

Ans:

Data purging is an important step in maintaining appropriate data in the database.
Basically deleting unnecessary data or rows which have NULL values from the database is nothing but data purging. So if there is a need to load fresh data into the database table we need to utilized database purging activity. This will clear all unnecessary data in the database and helps in maintaining clean and meaningful data.
Data purging is a process where junk data that exists in the database gets cleared out.