Why concerns on data trust can slow down AI development flow?
The objective of this article is to discuss approaches to manage the issue of data trust, data lineage that impacts AI model development flow, and the result of AI decisions and actions.
According to Robert D.Thomas’s book “The AI Ladder” — AI may be the greatest opportunity of our time, with potentials to add nearly $16 trillion to the global economy over the next decade”. Apply AI might require creating or acquiring data from other sources that are relevant to your business problem, and that can be used to train the AI models that will then be used to understand and respond to that data in the real world on top of your existing data. Hence, data is the source of energy for AI but complexities slow down the progress.
One of the AI system challenges face today is “Trust”. We can build an AI system for solving business domain problems. However, there is skepticism about AI systems and processes. For example, can we explain the result and decision logic from an AI model trained using Deep Learning algorithms? which is hard. What about the risk of bias that may introduce from the model training dataset? Can we ensure fairness on the AI system decision?
More importantly, if the result was wrong due to bad data used to train the AI model is a red flag of broken data quality then wishes the data has a GPS routing map exist can trace back where went wrong.
Data is crucial to every organization’s competitiveness but building and maintaining a quality big data architecture that can support the disparate source of data and processing data at scale up quickly using distributed processing is feast on Cloud or On-Prem. So we need an expanded approach to help us control and managing data quality at a scale that we use to build AI models and measurable metrics that we can say the dataset is good.
What is Data Lineage?
Data Lineage is a data GSP routing map in simple definition, a journey data takes from creation, though it’s the transformation over time. It describes data origins, movements, characteristics, and quality. Data Lineage should be able to answer the following sample questions:
- What processes create, update, and delete a given data element, who operates and owns these processes?
- What other data elements have been used to derive a given data element and what is the derivation logic?
- What control is in place along the data lineage to control the quality of a given data element?
- How complete is the data lineage repository for an organization unit and where it stores?
- Why record X is missing from today’s processing?
- What known issues related to a given data element at any point in its lineage?
- Is the dataset still up-to-date?
- etc.
Why keep track of the Data Lineage?
Business today is under pressure to reliably demonstrate data’s origin and transformation throughout the organization. Challenges to Data Quality include movement, transformation, interpretation, and selection through people and processes.
There are many benefits of Data Lineage, including Data Governance, Compliance, Data Quality, and Business Impact Analysis. Understand relationships and document the where and how of your data. In short, it can help businesses make better decisions and respond more rapidly to business opportunities and regulations.
Technology projects have used traditional approaches to Data Lineage. For example, during the creation of a new Clinician/Patient system, it would create a map of tables and joints to guide report summarization or grouping of the data. As AI project growth, only applying the traditional approach to Data Lineage encounters roadblocks, especially managing master data, information about people, processes, and things that form the business core.
Open Source Data Lineage Tools for the Cloud and on-Prem
Fortunately, the Open-source world has to start tacking this issue. There are many tools that we can use today, that help address some of the data quality concerns and help build up an end-to-end Data Lineage solution at an affordable price — free.
I think we can roughly classify and divide open-source tools into 4 categories of concerns the corresponding open-source tool focus.
Area of Concerns
- Data Collection, Catalog, Metadata, Business Definition, and Governance
- Build, Run, and Manage
- Watch, Validate and Manage
- Discovery and Consume
Open Source Tools for corresponding concern area
- Apache Atlas, Egenia — help build a clear picture of the organization’s definition, data assets, model, and process that visible to everyone.
2. Marquez (for Apache Flink), Spline (Apache Spark), Apache Nifi — help reduce operational complexity and time spent on troubleshooting.
3. Data Lineage DB and Data Lake — track changes and keep data-lineage up-to-date.
4. Datahub (Linkedin), Amundsen — search, changes tracking, quality monitoring
Above is just a shortlist of open-source tools available today. I think you get the idea that it will grow over time. Even though we have a set of great open-source data lineage tools now, but most of them only focus on addressing one area of concern. It would be nice to have one integrated tool that works together to become an end-to-end lineage solution. While we will not do a deep dive into each product in this article yet. In the future, will investigate further how to make these tools work together.
Commercial Tools
How commercial companies tackle the trust issue, a company like IBM offers Watson OpenScale to track and measure outcomes from AI across its lifecycle and adapts and governs AI to changing business situations. Watson OpenScale identifies and automatically mitigate harmful biases. Also runs a sophisticated set of diagnostic services to assess the accuracy of the model built-in state-of-the-art anomaly and bias detection capabilities.
In summary, adding Data Lineage to AI workflow can help ease the nervousness of data trust since it provides a visible map on routing and evidence for traceability and troubleshooting which makes it easier for data scientists, application developers, IT and AI operations teams, and business process owners to collaborate in building, running, and managing production AI faster with quality data.
Hope you find this short article useful, and meaningful…
Reference:
IBM complimentary book: The AI Ladder https://www.ibm.com/downloads/cas/O1VADKY2/?cm_sp=ThinkDigitalResources-_-DataandAI-_-AILadderDownload