Back to: Data Science Tutorials
Data formats, Data Quantity and Data Quality
In this article, I am going to discuss Data Formats, Data Quantity, and Data Quality in Data Science. Please read our previous article, where we discussed the Open Sources of Data.
Data formats, Data Quantity and Data Quality
The following are examples of research data formats, but they are not exhaustive.
- MS Word documents,.txt files, PDF, RTF, and XML are all examples of text files (Extensible Markup Language)
- Multimedia – jpg / jpeg, gif, tiff, PNG, MPEG, mp4 Numerical – SPSS, Stata, Excel QuickTime
- 3D and statistical models
- Java, C, and Python are examples of software.
- Formats tailored to a specific discipline – Astronomy’s Flexible Image Transport System (FITS), crystallography’s Crystallographic Information File (CIF).
- Olympus Confocal Microscope Data Format, Carl Zeiss Specimen Collections are examples of instrument-specific formats.
When thinking about data formats, another factor to consider is whether the format is proprietary or an open, community-supported standard. Some proprietary formats, such as.docx and.xlsx, are so widely used that they are likely to be around for a long time, avoiding format obsolescence.
Every day, the world generates approximately 2.5 trillion bytes of data. These figures have been rising for years as a result of the hyper-connectivity that has been induced by digitalization, the Internet of Things, and social networks. Big Data ecosystems can capture, store, and manage massive amounts of data. The foundation for analyzing their data and extracting its value. This fact is a gold mine for businesses that can extract value from data to improve processes, reduce costs, or maximize profits.
The various advanced analytics and Artificial Intelligence techniques aid in our understanding of business processes. They assist us in understanding what occurred (Descriptive Analytics), why it occurred (Diagnostic Analytics), what will occur in the future (Predictive Analytics), and which decision is the best among all possible ones (Prescriptive Analytics).
However, the abundance of available information poses a challenge. Almost 80% of the data generated is incorrect or incomplete, and thus useless for business decision-making.
When using Artificial Intelligence techniques, data quality is critical because the results will be as good or bad as the quality of the data used.
Entering erroneous or biased data is dangerous. The algorithms that feed Artificial Intelligence systems can only assume that the data to be analyzed is reliable. If they are incorrect, the results will be deceptive, and the decision-making process will be jeopardized.
More data, in general, leads to more reliable models and thus better results, but only if the data is real and representative. It is preferable to use less data than more data with poor quality. Although there are times when the amount of quality data available is insufficient to train and model the problem to be solved and thus provide a solution based on Data Analytics and Artificial Intelligence.
Another recurring issue is that, even if the data set to be analyzed is adequate to fully exploit Artificial Intelligence systems, there is always a tendency to collect additional data due to the low cost of storage and processing power.
The current trend of generating and storing large amounts of data does not appear to be changing in the near future. As a result, it is critical for businesses to establish a set of rules and procedures that define and regulate how data will be handled.
In the next article, I am going to discuss Data Transformation and Data Anonymization. Here, in this article, I try to explain Data Formats, Data Quantity, and Data Quality and I hope you enjoy this Data Formats, Data Quantity, and Data Quality article.