Using Big Data Analytics to produce high-quality Big Data Storage
By Andrei Khurshudov, Chief Technologist, Seagate, Mark Brewer, SVP and CIO, Seagate Technology, Michael Crump, VP of Quality, Seagate Technology
The preservation of human knowledge is of paramount importance to progress, now and in the future. And because the vast majority of new data is stored digitally, the need for reliable digital storage is greater than ever. The challenge today is ensuring that the drives mass-produced by the storage industry in order to keep up with the ever-growing need for data storage are manufactured according to the highest standards of quality. The solution to that challenge may lie in a relatively new but fast-growing field known as Big Data Analytics.
The need for reliable data storage is particularly urgent in light of the fact that the amount of data stored every year is increasing rapidly. Indeed, much more data is generated than is actually stored. For example, CERN generates close to a petabyte of data every second while particles fired around the Large Hadron Collider at velocities approaching the speed of light are smashed together. But CERN can only store approximately 25 PB of this data every year—equivalent to about 8,333 full 3 TB hard disk drives.
When a disk drive is manufactured it acts as an intelligent sensor that is aware of its own health and quality, and it stores its own sensor logs. These drives are tested for many days, and during that time, they might generate megabytes of test, diagnostic, and configuration data — as many as a 1,000 variables logged for each drive. In addition, information is collected about every important component going into each drive, how these components are combined, where and when each component and each drive was built, which firmware is used, which customer it goes to, and many other pieces of information.
The resulting combination of parameters, attributes and measurements can result in hundreds of thousands of combinations and resulting interdepencies. Analyzing these combinations alone and together requires new ways, new tools and new ideas in order to separate key signals or information from noise. There are so many variables and parameters that affect drive quality, reliability, and performance that no traditional data analysis approach can easily work on the data generated and collected during the manufacturing process.
How do we address this drive quality and reliability challenge? Through Big Data Analytics, which combine such techniques as advanced statistics and machine learning with large amounts of data to extract those answers that are not visible to more traditional analytics, operating with smaller data set. With so much data available, using Big Data Analytics can help control product quality and troubleshoot issues as quickly as possible.
The first thing we need in order to implement Big Data Analytics that ensures magnetic hard drive reliability is a robust, coherent, end-to-end data collection process which captures everything that could be important, and offers it for further analysis. This data will be available when it’s needed, and found where it’s expected. And it’s coherent in the sense that all those pieces of data can be matched together as needed. A hard drive will be subject to this process starting from the time and place where each main complement is “born” to the drive factory, through the assembly lines, days of configuration and testing, to the customer who is using them to build computers or storage systems, to the end user, all the way to the end of the drive’s life.
Second, we need storage infrastructure and an ecosystem that lends itself to Big Data Analytics and complex data mining. That means that a more traditional Enterprise Data Warehouse architecture running relational databases should be complemented by (and linked to) solutions designed for distributed analytics and parallel computing, providing a modern ecosystem with Hadoop / Spark capabilities, no-SQL databases (such as MongoDB and Cassandra), and the ability to store all data possible (both structured and unstructured), and access it in parallel for better performance.
Third, we need trained personnel using Big Data Analytics algorithms and solutions: true Data Scientists capable of working with extremely large data sets using the most advanced machine-learning techniques, and seamlessly linking all the best programming environments and languages, machine-learning libraries, and elements of a highly-distributed storage and analytics ecosystem together. Together they can understand the complex data generated through testing, and guarantee the best product quality, reliability, and performance possible.
This is the approach that Seagate has implemented, and it has already resulted in a dramatic improvement in the quality of our products—which means more data can be preserved to retrieve and use in the future.
Modern challenges require modern approaches. Making highly reliable devices to store all of the data generated in today’s world, mass-producing these devices in tens of millions per quarter, becomes impossible without Big Data Analytics and Machine Learning technology. These are now a requirement for any leading high-volume technology company in the 21th century.