Most of a data scientist’s day is not spent training the machine learning models — it is spent finding and preparing the data.
Ab Initio makes it easy to catalog your data, cleanse it, and identify appropriate data subsets. Ab Initio software also simplifies the organization of data into a big wide record containing the necessary inputs to the machine learning models. We help you add new data, understand your data, and join datasets together into the big wide record.
Extracting undifferentiated data from a multi-petabyte data lake can be like trying to find a specific water droplet in an ocean.
One global bank didn’t so much have a Hadoop data lake as a data ocean. In theory, the information was available with a simple query, but the reality was considerably different. Getting information from their data ocean was problematic.
Different reporting tools expected the data in different formats (and couldn’t read Hadoop formats). Reports generated for specific customers required looking up data from different sources. Trying to manually work with petabytes of data was like drinking the ocean. Originally, those reporting tools went directly to the original data sources to extract information. That worked when the bank was smaller and had much less data. Now, it took an enormous data lake to manage the data, and the bank had migrated all their data into it. However, the reporting tools could not handle the format change.
That’s where Ab Initio came in.
The bank was dealing with petabytes of undifferentiated data. Using Ab Initio software, analysts could quickly develop rules to rapidly locate and filter the data. For each reporting tool, it could identify what the tool needed, get the data from the lake, and make it look like it was coming from the source the tool was expecting. They could search the lake for that single drop of water and deliver it to the right place.
When data needed to be generated from multiple data elements in the lake, Ab Initio again came to the rescue. Analysts could develop additional rules to determine the shape the tables should have, do the appropriate joins on the undifferentiated data, populate the tables, and feed those resulting tables to the reporting tools that needed the data in exactly that format. Suddenly, that massive data lake was useful — Hadoop was living up to its hype, courtesy of Ab Initio.
Sure, the bank could have met their immediate needs with traditional databases rather than with Hadoop. Planning for the long haul, they knew that they would soon have to handle many petabytes of data. The cost of using traditional databases would swiftly become prohibitive. They could continue to expand their Hadoop clusters for relatively little cost. By using Ab Initio to get their data lake working today, they were assuring the success and profitability of their business tomorrow.
It takes an ocean-sized data lake to manage petabytes of data. However, that data lake is only useful when it’s easy to find and use the data. Thanks to Ab Initio, that’s exactly what happened.