Date: TBA

Summary of the Tutorial

In today's data-driven world, organizations derive insights from massive amounts of data through large scale statistical machine learning models. However, statistical techniques can be easy to fool with adversarial instances (a neural network can predict a non-extremist as an extremist by mere presence of the word Jihad), which raises question in Data Quality. In high stakes decision making problems, such as cyber social threats, it is highly sensitive to classify a non-extremist as an extremist and vice-versa. Data quality is good if the data possesses adequate domain coverage and the labels contain adequate semantics. For example, is the semantics of an extremist vs. non-extremist vis-a-vis the word Jihad captured in the label (adequate semantics in labels)? Also, are there enough non-extremists with the word Jihad in the training data from the perspective of religion, hate, or ideology? Thus semantic annotation of the data, beyond mere labels attached to data instances, can significantly improve the robustness of model outcomes and ensure that the model has learned from trustworthy, knowledge-guided data standards. It is important to note that the knowledge-guided standards help de-bias the data if specified correctly (contextualized de-biasing extremist behavior data from bias towards the word Jihad). Therefore, in addition to trust in the robustness of outcomes, knowledge guided data creation also enables fair and ethical practices during real-world deployment of machine learning in high stakes decision making. We denote such data as Explainable Data. In this tutorial of type course and case-studies, we detail how to construct Explainable Data using various expert resources and knowledge graphs. All the materials (resources and implementations) presented during the tutorial will be made available on: KIWO-ICWSM, a week before the tutorial. We plan a 90 minute tutorial (Intermediate Level) with 2 breaks (5 mins each).

Tutorial Schedule and Activities

The current landscape of AI (30 mins)

In this section, we cover AI systems that have been developed in the field of consumer health question answering, mental health care informatics, crisis response, addiction response and cyber social harassment [6,7,8,12,9]. We discuss the techniques and present a comprehensive survey of datasets used to train AI systems for social and public health [11,15]. We evaluate the datasets in terms of:

  1. Domain Coverage - Does the dataset represent the population across Spatio-temporal and thematic variations? For example, we can bootstrap a sparse, region-specific dataset using a domain-specific Ontology to be applicable in a different region (Using opioid crisis analysis in the midwest to understand the opioid epidemic in Ohio).
  2. Adequacy of labels in capturing the semantics - Do the dataset labels contain enough information to model and understand the context? Consider studying suicide risk severity in crisis scenarios. The labels suicidal indication, ideation, behavior, and attempt, which are supported by a clinically accepted questionnaire (Columbia Suicide Severity Rating Scale [2]), captures the context more meaningfully than no risk, low risk, moderate risk, or severe risk.
  3. Hidden Biases - Does the dataset have inherent biases that machine learning won’t find on its own? Consider a study of religion, ideology, and hate; aggregating the population over these categories introduces inherent aggregation biases (a particular religion or ideology in association with extremism) that knowledge can effectively identify.

We motivate the need for Explainable Data to train machine learning models whose outcomes can be trusted to be robust and fair when deployed in high stakes decision-making applications, especially about cyber social threats and public health.

Explainable Data: Examples (30 mins)

We present our work on creating Explainable Data, including use-cases from mental health care informatics, crisis response, Addictions response and cyber social harassment [2,8,13]. We show that domain experts and end-users trust outcomes from models trained on these datasets to be accurate in their predictions and reliably de-biased [4]. As an example, we created a dataset from the dark web curated from drug abuse ontology. The drug entities are extracted from product descriptions using ontology mappings. Those drug entities helped the domain experts extract social media opinions about different drugs being sold in marketplaces. When the deep learning model is trained on this kind of data, we can explain the predictions to domain experts which entity or layer on the model contributed to such prediction as we had a good handle on how the dataset is created [14,1,9]

Explainable Data: Guidelines and Evaluation (30 mins)

Lastly, we present general guidelines for Explainable Data creation and evaluation metrics to assess/analyze the extent to which models trained on the datasets are explainable and trustable. To conclude, we briefly discuss future directions for data creation with higher-level aims such as generalizability to abstractly related fields (Eg: puzzle-solving in relation to molecular design).

Target audience, prerequisites and outcomes

ICWSM is a forum for researchers from multiple disciplines to share knowledge, discuss ideas, exchange information, and learn about cutting-edge research in diverse fields with the common theme of online social media. ICWSM is a singularly fitting venue for research that blends social science and computational approaches to answer important and challenging questions about human social behavior through social media while advancing computational tools for vast and unstructured data. This overall theme of this tutorial includes research in new perspectives in social theories and computational algorithms for analyzing social media. The tutorial will describe methods for domain specific dataset curation aligning with the interdisciplinary character of ICWSM. This tutorial will bring researchers and practitioners together at the confluence of contextualized knowledge representation, reasoning, semantic linking, natural language processing, and deep learning.

Essentially, the tutorial will provide the audience an opportunity to learn more about the effectiveness of contextualized dataset in explainable AI. It will convince the attendee on the value of domain knowledge-enhanced AI techniques to overcome obstacles in explaining social good domain results, and better manage the lack of high-quality training data and poor interpretability they currently face. Similar to our recent and past tutorials, we will have comprehensive web resource for the benefit of the community.


Our tutors have co-presented multiple tutorials as follows:

  1. Manas Gaur, Keyur Faldu, Ankit Desai, Amit Sheth. Explainable AI using Knowledge Graphs, ACM CoDS-COMAD 2021 : This tutorial discussed the notion of explainability and interpretability through the use of knowledge graphs in Healthcare and Education Technology.
  2. Manas Gaur, Ugur Kursuncu, Amit Sheth, Ruwan Wickramarachchi, Shweta Yadav, Knowledge-infused Deep Learning , ACM HT 2020 \cite{gaur2020knowledge}: This tutorial discussed the novel paradigm of knowledge-infused deep learning to synthesize neural computing with symbolic computing and described different forms of knowledge and infusion methods in deep learning with evaluation and pointers to some nature testbeds.
  3. Hussein Al-Olimat, Amir Yazdavar, Krishnaprasad Thirunarayan, Amit Sheth, Location Extraction and Georeferencing in Social Media: Challenges, Techniques, and Applications, ISCRAM 2018: The tutorial presented the general problem of georeferencing, location extraction, summarized the state-of-the-art research, discussed challenges, and provided an overview of AIISC's recent research accomplishments in the context of disaster management.
  4. Hemant Purohit, Amit Sheth, Patrick Meier, Carlos Castillo, Crisis Mapping, Citizen Sensing, and Social Media Analytics: Leveraging Citizen Roles for Crisis Response, AAAI ICWSM 2013: The tutorial covered three themes of citizen sensing and crisis mapping, technical challenges and recent research for leveraging citizen sensing to improve crisis response coordination, and experiences in building robust and scalable crisis management platforms. This tutorial had ~70K views.

This is the first time the highlighted tutors are delivering a tutorial together, albeit having worked together on various talks and presentations at PyData and Health-Science Conferences. While past tutorials and presentations have covered incorporating knowledge into the training methodology, we explain how knowledge is used to enhance the dataset set used for training in this tutorial [3,5]. Such knowledge guided data enrichment is easy for DEs and end-users to understand, enabling their trust in outcomes of models. Additional benefits include generalizable and reliable behavior across training methods and tasks on similar data.

Presenters' Biographies

Amit Sheth


Artificial Intelligence Institute, University of South Carolina

Prof. Amit Sheth is an Educator, Researcher, and Entrepreneur. He is the founding director of the university-wide Artificial Intelligence Institute at the University of South Carolina (#AIISC). Previously , he was the LexisNexis Ohio Eminent Scholar and the executive director of Ohio Center of Excellence in Knowledge-enabled Computing. He is a Fellow of IEEE, AAAI, and AAAS. He has organized 75+ international events (general/program chair, organization committee chair), 65+ keynotes, given many well-attended tutorials and is among the well-cited computer scientists. He has founded three companies by licensing his university research outcomes, including the first Semantic Web company in 1999 that pioneered technology similar to what is found today in Google Semantic Search and Knowledge Graph. Several commercial products and deployed systems have resulted from his research.

Kaushik Roy

Artificial Intelligence Institute, University of South Carolina

He is a Ph.D. student at AIISC. He completed his master's in computer science at Indiana University Bloomington and has worked at UT Dallas’s starling lab. His research interests include statistical relational artificial intelligence, sequential decision making, knowledge graphs, and reinforcement Learning. His work is published in reputed conferences in IEEE, KR, AAAI.

Manas Gaur

Contact| @manasgaur90

Artificial Intelligence Institute, University of South Carolina

Manas Gaur is currently a Ph.D. Student in Artificial Intelligence Institute at the University of South Carolina. He has been Data Science and AI for Social Good Fellow with the University of Chicago and Dataminr Inc. His interdisciplinary research funded by NIH and NSF operationalizes the use of Knowledge Graphs, Natural Language Understanding, and Machine Learning to solve social good problems in the domain of Mental Health, Cyber Social Harms, and Crisis Response. His work has appeared in premier AI and Data Science conferences (CIKM, WWW, AAAI, CSCW), journals in science (PLOS One, Springer-Nature, IEEE Internet Computing), and healthcare-specific meetings (NIMH MHSR, AMIA).

Usha Lokala

Artificial Intelligence Institute, University of South Carolina

She is a Ph.D. student at AIISC. Her research interests include ontology engineering, knowledge graphs and natural language processing. Her work has been published in reputed conferences and Journals (IEEE, Drug and Alcohol Dependence, WWW, CPDD). [10]'s work on public health addictions won second prize in Opioid Challenge at SBP BRiMS 2018, a computational social science conference.


  1. Daniulaityte, R.; Lamy, F.; Gaur, M.; Thirunarayan, K.; Kursuncu, U.; Sheth, A. P.; et al. 2020. Dao: An ontology for substance use epidemiology on social media and dark web. JMIR Public Health and Surveillance.

  2. Gaur, M.; Alambo, A.; Sain, J. P.; Kursuncu, U.; Thirunarayan, K.; Kavuluru, R.; Sheth, A.; Welton, R.; and Pathak, J. 2019. Knowledge-aware assessment of severity of suicide risk for early intervention. In ACM Web Conference.

  3. Gaur, M.; Kursuncu, U.; Sheth, A.; Wickramarachchi, R.; and Yadav, S. 2020. Knowledge-infused deep learning. In ACM Hypertext and Social Media.

  4. Gaur, M.; Faldu, K.; and Sheth, A. 2020. Semantics of the black-box: Can knowledge graphs help make deep learning systems more interpretable and explainable? arXiv preprint arXiv:2010.08660.

  5. Gaur, M.; Kursuncu, U.; and Wickramarachchi, R. 2019. Shades of knowledge-infused learning for enhancing deep learning. IEEE Internet Computing.

  6. He, Y.; Zhu, Z.; Zhang, Y.; Chen, Q.; and Caverlee, J. 2020. Infusing disease knowledge into bert for health question answering, medical inference and disease name recognition. arXiv preprint arXiv:2010.03746.

  7. Kursuncu, U.; Gaur, M.; Lokala, U.; Illendula, A.; Thirunarayan, K.; Daniulaityte, R.; Sheth, A.; and Arpinar, I. B. 2018. What’s ur type? contextualized classification of user types in marijuana-related communications using compositional multiview embedding. In Web Intelligence.

  8. Kursuncu, U.; Gaur, M.; Castillo, C.; Alambo, A.; Thirunarayan, K.; Shalin, V.; Achilov, D.; Arpinar, I. B.; and Sheth, A. 2019. Modeling islamist extremist communications on social media using contextual dimensions: Religion, ideology, and hate. ACM CSCW.

  9. Lamy, F. R.; Daniulaityte, R.; Barratt, M. J.; Lokala, U.; Sheth, A.; and Carlson, R. G. 2020. Listed for sale: analyzing data on fentanyl, fentanyl analogs and other novel synthetic opioids on one cryptomarket. Drug and alcohol dependence.

  10. Lokala, U.; Lamy, F. R.; Daniulaityte, R.; Sheth, A.; Nahhas, R. W.; Roden, J. I.; Yadav, S.; and Carlson, R. G. 2019. Global trends, local harms: availability of fentanyl-type drugs on the dark web and accidental overdoses in ohio. Computational and Mathematical Organization Theory.

  11. Olteanu, A.; Castillo, C.; Diaz, F.; and Kiciman, E. 2019. Social data: Biases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data.

  12. Wang, T.; Zhao, J.; Yatskar, M.; Chang, K. W.; and Ordonez, V. 2019. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In IEEE ICCV.

  13. Wijesiriwardene, T.; Inan, H.; Kursuncu, U.; Gaur, M.; Shalin, V. L.; Thirunarayan, K.; Sheth, A.; and Arpinar, I. B. 2020. Alone: A dataset for toxic behavior among adolescents on twitter. In Social Informatics.

  14. Yadav, S.; Lokala, U.; Daniulaityte, R.; Thirunarayan, K.; Lamy, F.; and Sheth, A. 2020. “when they say weed causes depression, but it’s your fav antidepressant”: Knowledge-aware attention framework for relationship extraction.

  15. Yang, K.; Qinami, K.; Fei-Fei, L.; Deng, J.; and Russakovsky, O. 2020. Towards fairer datasets: Filtering and balancing the distribution of the people subtree in the imagenet hierarchy. In ACM FAT.