The role of data processing in data mining
The concept of data processing
Data processing is the conversion of data into suitable for use and wished format. There is define operating sequence by which to realize the conversion of data. The process of conversion is carried out automatically or manually. Nowadays most data is processed with the help of computer equipment. Thus, data can be converted into different forms. It can be graphic as well as audio ones. It depends on the used software as well as data processing methods.
Processing of data is a key step of the data mining process. Raw data processing is a more complicated task. Moreover, the results can be misleading. Therefore, it is better to process data before analysis.
Stages of data processing
No matter which way data is processed any one of them requires preliminary data collection. Collected data go through such stages:
- Data collection
- Data storage
- Data sorting
- Data processing
- Data analysis
- Data reporting
When data is gathered, there is a need to store it. The data can be stored in physical form using paper-based documents, in laptop and desktop computers or in any other data storage devices. With the rise and rapid development of such thing as Data Mining and Big Data, the process of data collection becomes more complicated and time-consuming. It is necessary to carry out many operations to conduct a thorough data analysis.
At present data is, for the most part, stored in a digital form. It allows processing data faster and converting into different formats. The user has the possibility to choose the most suitable output.
The next stage followed by data storage is sorting and filtering. At this stage, the format of stored data plays a crucial role and depends on used software. Simple data can be stored in the form of text files or tables or a mixture of both. If data is complicated and requires special handling data processing tools are used to perform tasks that are more challenging.
Data manipulation can be carried out with the help of single software or if data requires analysis that is more detailed a set of software is needed to apply.
Technologies of data processing
There are three data processing technologies:
- Manual data processing
It involves the processing of data by hand only. Any additional tools or devices are not applied. All of the data manipulations are carried out manually.
- Mechanical data processing
It is data processing, which entails the use of a mechanical device for work with data. In this case, ordinary electronic devices also can be used. Such devices are calculators or typewriters. Simple operations with data can be realized by means of this method.
- Electronic data processing
This one is the most progressive. It is realized by means of computers. The use of this method allows for processing an increasing amount of data and provides results that are more accurate.
Data pre-processing methods
Results of data mining depend on the quality of source data. In order to get data of good quality, it is necessary to preprocess source data. It allows for improving efficiency and facilitating data mining. Preprocessing of data is the preparation and conversion of the original one.
Data preprocessing methods are provided below:
- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction
Data is not always complete. It can miss attribute values or include only aggregate ones. Moreover, data can be noisy, duplicated and inconsistent. There may be human or computer errors at data entry. Such things affect negatively the data mining process. To make the situation better, it is applied to data cleaning. This procedure allows cleaning the data through the imputation of missing values, removing outliers and reconciling inconsistencies. Due to data cleaning works, results at the output level will be more robust.
Data can be combined from different sources into one data storage. Such sources can be various databases, data cubes, and unstructured files. In order to structure data from multiple sources, data integration is used. It is realized by means of metadata (it is also called data about the data) which allows avoiding errors in the integration of data.
Data transformation is one more important procedure on a fast track to receiving final data of good quality. It presents data conversion into forms that are suitable for data mining.
Transformation of data includes normalization, noise data smoothing, aggregation, generalization. Most often, it is realized via a combination of manual and automated processing.
Data mining can be a very time-consuming process especially when there is a necessity to analyze huge volumes of data. In this case, the analysis can be unreasonable. That is where data reduction is used. It enables to analyze reduced data representation without prejudice to the source data integrity and while retaining good quality information.
In the meantime, data mining on the reduced volume of data should be performed more efficiently and the outcomes must be of the same quality as if the whole dataset is analyzed. Data reduction involves the following strategies:
- Data cube aggregation
- Dimension reduction
- Data compression
- Numerosity reduction
- Discretization and concept hierarchy generation
The importance of data processing in data mining
In today's world, data has a significant bearing on researchers, institutions, commercial organizations, and each individual user. After gathering, the question arises how to store, sort, filter, analyze and present data. Here data mining comes into play. As data is often imperfect, noisy, and incompatible, it requires additional processing.
The complexity of this process is subject to the scope of data collection and the complexity of the required results. Whether this process is time-consuming depends on steps, which need to be made with the collected data and the type of the output file desired to be received. This issue becomes actual when the need for processing a big amount of data arises. Therefore, data mining is widely used nowadays.