Abstract

The unparalleled volume of data generated has heightened the need for approaches that can manage and translate them into actionable insights. While the contemporary data-driven and generative systems are popular for handling large volume of changing and diverse data, they are not silver bullets due to the inherent lack of knowledge grounding. The emerging use of knowledge-driven processes have surfaced as compelling approaches for leveraging external knowledge and structured representation to complement the shortcomings within data-driven systems. Such processes which while exploiting data, also use extensive knowledge in the form of Knowledge Graphs (KGs). In this tutorial, we will introduce and provide interactive hands-on and lab-oriented sessions on the knowledge-driven processes for big data management and applications using realworld datasets ranging from structured, semi-structured, and unstructured formats. Specifically we will use the EMPWR [1] platform for creating and maintaining large KGs and demonstrate recent innovations in three concrete real world use-cases: (i) development of a pharmaceutical KG with over 6M triples, 1.5M nodes, and 3000 relation types; (ii) development of a suite of large scale KGs with 10M+ triples, 2M+ entities, and 19 relations from real-world driving scenes and their use in machine perception tasks; and (iii) AI pipelines recommender system with KG consisting of 78M triples, 8M nodes, and 25M relations.

Why Attend?

The sheer volume, variety, and velocity of data generated in the current digital era have heightened the need for approaches that can effectively manage and make sense of these big data. While modern data-driven systems, including Large Language Models (LLMs), can learn complex patterns and relationships, their internal understanding of the world (i.e., the world model) can be brittle and is influenced by the data that they are trained on. As we enter what DARPA describes as the third phase of AI (i.e., Neurosymbolic AI [8]), knowledge-driven systems such as Knowledge Graphs (KGs) are increasingly used to provide notational efficacy and declarative capabilities to make implicit data explicit, enabling high-quality linguistic and situational knowledge for more explainable output. Recognizing the merits of both data and knowledge-driven processes, we advocate an approach that hybridizes the wide variety of techniques from both spectra for managing big data with KGs as the infrastructure. Considering three application domains, we will guide the audience with demonstrations and hands-on sessions on the infrastructure with real-world tasks.

Detailed Descriptions

Session 1 - The Current State-of-the-art Knowledge-driven Processes with EMPWR

We first review the current state of KG lifecycle design practice and the associated challenges. We will discuss the sets of system design guiding principles in developing scalable KG infrastructure for managing data from diverse sources and formats (including representation and provenance); system architectures, and workflow components. We will then deepdive the process (with hands-on) of creating a KG with EMPWR [1], a comprehensive KG development and lifecycle support platform, drawing experiences from our work in pharmaceutical KG construction (partnership with collaborator WIPRO) with over 6M triples, 1.5M nodes, and 3000 relation types and interconnecting knowledge from open and domain-specific sources. Subtopics: We will demonstrate different strategies and tools including leveraging LLMs to bootstrap the KGs creation. We will also showcase the Common Metadata Framework (CMF) for metadata logging at each stages.

LINK TO SLIDES

Session 2 - Leveraging KGs for Machine Perception Tasks in Autonomous Driving (AD)

Despite recent advancements, AD is still far from achieving full level-5 autonomy. The primary obstacle is navigating in an open-world environment, which requires making informed decisions within novel and complex situations. The associated challenges, such as machine perception (detecting and recognizing objects/events in a scene) and partial observability, are yet to be resolved. We will demonstrate how large-scale KGs for AD can be developed using two high-quality, open, realworld datasets and their integration with perception processes to overcome such challenges [2]–[4]). We will also showcase how external knowledge, such as commonsense and geospatial knowledge, can be integrated to enrich the domain KGs representing real-world scene. Subtopics: We will demonstrate the application of these KGs on two high-level tasks in machine perception and reasoning: (i) knowledge-based entity prediction [4], which refers to the prediction of scene entities that may have gone unobserved and unlabeled in current datasets, and (ii) explainable scene clustering [5], which refers to clustering a given set of scenes and assigning descriptive label(s) to each cluster.

LINK TO SLIDES

Session 3 - AI Pipeline Recommendations

With the plethora of new AI technologies, the task of selecting and deploying suitable models for a given task and dataset can be challenging. Referencing parameters from relevant published pipelines can often act as the seed for experiments. However, it requires metadata of executed pipelines capturing the interactions and dependencies of each pipeline components. While open-source platforms such as Papers-withcode, OpenML, Hugging Face, and Kaggle can be leveraged, integrating data from these diverse sources presents challenges such as differing nomenclature, data structure variations, and the lack of component semantics to understand context. We will introduce the AI Pipeline Metadata Knowledge Graph (AIMKG) formalized by Common Metadata ontology (CMO) which integrates and aggregates data from diverse sources as well as supports multimodal properties to provide contextaware recommendations. Subtopics: We will demonstrate a search and recommender system that showcases the potential of AIMKG consisting of 1.6M pipelines and 78M triples with 8M nodes and 25M relationships to recommend relevant pipelines for optimization while ensuring reproducibility, reusability, and explainability [6], [7].

LINK TO SLIDES

Targeted Audience and Prerequisites

Our intended audiences are researchers, students, and practitioners with interest in learning and utilizing knowledge-driven processes for big data management and its applications in realworld problems. The prerequisites include a basic understanding of deep learning, familiarity of popular data processing frameworks (e.g., NumPy, Scikit-learn, and Pandas), LLMs, and semantic technologies (e.g., RDF, OWL, linked open data). We expect that by the end of the tutorial, the attendees will understand how knowledge-driven processes can be effectively leveraged as an infrastructure to manage big data and tackle core challenges in various domains and critical applications.

Tutorial Schedule

Dec 17, 2024 (All times are in EST timezone)

The tutorial is organized with a total duration of 3 hours with participants engagement in labs and exercises as follows:

09:00 - 09:10 AM > An Overview of the KG Lifecycle And Introduction to Data And Knowledge-driven Processes (10 mins)
09:10 - 09:50 AM > Session 1: The Current State-of-the-art KG Development with EMPWR: A Case Study with Pharmaceutical KG (40 mins)
09:50 - 10:00 AM > Session Break
10:00 - 10:40 AM > Session 2: Leveraging KGs for Machine Perception Tasks in Autonomous Driving (AD) (40 mins)
10:40 - 10:50 AM > Session Break
10:50 - 11:30 AM > Session 3: AI Pipeline Recommendations (40 mins)
11:30 - 12:00 PM > Concluding Remark & Q&A

Speakers

Joey Yip

AIISC, University of South Carolina

Joey Yip is a Ph.D. student at the AI Institure, University of South Carolina. His primary research includes intertwining machine with deep learning, knowledge graph (semantic web), and generative AI for knowledge creation and natural language understanding. He has published and co-authored at JMIR journal, WWW conference, and IEEE Internet Computing.

hyip@email.sc.edu

Ruwan Wickramarachchi

AIISC, University of South Carolina

Ruwan Wickramarachchi is a Ph.D. student at the AI Institute, University of South Carolina. His dissertation research focuses on introducing expressive knowledge representation and Neurosymbolic AI techniques to improve machine perception and context understanding in autonomous systems. He has co-organized several tutorials on Neurosymbolic AI, particularly at ISWC, U.S. Semantic Technologies Symposium (US2TS), and ACM Hypertext (HT).

ruwan@email.sc.edu

Revathy Venkataramanan

AIISC, University of South Carolina

Revathy Venkataramanan is a Ph.D. student at the AI Institute, University of South Carolina, and also an intern at Hewlett Packard Labs. Her research focuses on explainable recommendation systems, which aims to harness the generalization of deep learning models for pattern mining and knowledge graphs for reasoning and traceability. She works on diverse domains such as diet and nutrition, AI pipeline workflows, and smart manufacturing.

revathy@email.sc.edu

Amit Sheth

AIISC, University of South Carolina

Amit Sheth is the founding director of the AI Institute, University of South Carolina. His current core research includes knowledge-infused learning and explanationability. He is a fellow of IEEE, AAAI, AAAS, and ACM. He has co-organized over 100 international events and tutorials. He has founded three companies by licensing his university research outcomes, including the first Semantic Web company in 1999 that pioneered technology similar to what is found today in Google Semantic Search and KG.

amit@sc.edu

References

Yip, H. Y., Sheth, A. (2024). The EMPWR Platform: Data and Knowledge-Driven Processes for the Knowledge Graph Lifecycle. IEEE Internet Computing, 28(1), 61-69.
Henson, C., Wickramarachchi, R.: Introduction to knowledge-infused learning for autonomous driving (Mar 2022)
Wickramarachchi, R., Henson, C., Sheth, A.: Knowledge-infused Learning for Entity Prediction in Driving Scenes. Frontiers in Big Data 4, 759110 (2021)
Wickramarachchi, R., Henson, C., Sheth, A.: KnowledgeBased Entity Prediction for Improved Machine Perception in Autonomous Systems. IEEE Intelligent Systems 37(5), 42–49 (2022). https://doi.org/10.1109/MIS.2022.3181015
Chowdhury, S.N., Wickramarachchi, R., Gad-Elrab, M.H., Stepanova, D., Henson, C.: Towards leveraging commonsense knowledge for autonomous driving. ISWC (2021)
Venkataramanan, R., Tripathy, A., Foltin, M., Yip, H. Y., Justine, A., Sheth, A. (2023). Knowledge Graph Empowered Machine Learning Pipelines for Improved Efficiency, Reusability, and Explainability. IEEE Internet Computing, 27(1), 81-88
Venkataramanan, R., Tripathy, A., Hashmi, A.: Role of Ontologies in Enabling AI Transparency (Sept 2023)
Sheth, A., Roy, K., Gaur, M. (2023). Neurosymbolic Artificial Intelligence (Why, What, and How). IEEE Intelligent Systems, 38(3), 56-62.

Tutorial: Knowledge-driven Processes for Big Data Management and Applications

Abstract

Why Attend?

Detailed Descriptions

Session 1 - The Current State-of-the-art Knowledge-driven Processes with EMPWR

LINK TO SLIDES

Session 2 - Leveraging KGs for Machine Perception Tasks in Autonomous Driving (AD)

LINK TO SLIDES

Session 3 - AI Pipeline Recommendations

LINK TO SLIDES

Targeted Audience and Prerequisites

Tutorial Schedule

Dec 17, 2024 (All times are in EST timezone)

Speakers

Joey Yip

Ruwan Wickramarachchi

Revathy Venkataramanan

Amit Sheth

References