Opensource

Introduction

AI Institute of South Carolina (AIISC) seeks to make a significant majority of the tools, data and ontologies developed as parts of its research under an open source license for use by fellow researchers and often for broader use, especially when quality of software and tool are ready for external use. We also endeavor to support the evolving standards and specifications in a variety of related fields through W3C and other channels. AIISC will continue to develop community resources such as datasets, open source tools, and public services, which are hosted for significant periods after the end of respective projects and/or made available through public or open source distribution channels.

Tools

  • Location Name Extractor (LNEx): A fine-grained geoparsing tool which extracts location mentions from texts and geocode them using OpenStreetMap. The tool was specifically designed for disaster-related use-cases to support spatio-temporal analysis of data for disaster response.


    Developer: Hussein Al-Olimat

  • Geoann: an annotation tool for geocoding location names in texts. We use [brat.nlplab.org Brat] (a web-based tool for NLP-assisted text annotation) for visualizing the annotated texts. Geoann allows the annotator to retrieve all the required geo-information needed without leaving the annotation panel, i.e., it facilitates the retrieval and search of location names using Google Maps API. The tool then allows users to annotate location names by drawing bounding boxes of their spatial extents. Then, it saves the annotations in the same file-based stand-off format of each tweet.

  • Developers: Hussein Al-Olimat and Dipesh Kadariya

  • DisasterRecord: DisasterRecord meets the requirements of a variety of users. A humanitarian organization may analyze the situation at a community level for deploying and mobilizing necessary help. A first response coordinator can monitor a specific type of emergency needs. Affected individuals may need to know about the nearest available help. Persons wishing to provide support can identify current needs in the geographic proximity for the type of help they can provide.

  • Developers: Shruit Kar, Dipesh Kadariya, Mike Partin, and Hussein Al-Olimat.

Ontologies and Datasets

Active

  • Multimodal Ingredient Substitution Knowledge Graph (MISKG) : Ingredient substitution is essential in adapting recipes to meet individual dietary needs, preferences, and ingredient availability. We introduce a Multimodal Ingredient Substitution Knowledge Graph (MISKG) that captures a comprehensive and contextual understanding of 16,077 ingredients and 80,110 substitution pairs. The KG integrates semantic, nutritional, and flavor data, allowing both text and image-based querying for ingredient substitutions. Utilizing various sources such as ConceptNet, Wikidata, Edamam, and FlavorDB, this dataset supports personalized recipe adjustments based on dietary constraints, health labels, and sensory preferences. This work addresses gaps in existing datasets by including visual representations, nutrient information, contextual ingredient relationships, providing a valuable resource for culinary research and digital gastronomy.

    Links to dataset: GitHub, Kaggle

    Link to the License and Terms of Use

    Contact: revathycv24@gmail.com, revathy@email.sc.edu
    Date: November 12, 2024

  • Interaction based Online Harassment dataset : This dataset is related to the paper “ALONE: A Dataset for Toxic Behavior among Adolescents on Twitter (link to this resource: https://dl.acm.org/doi/10.1007/978-3-030-60975-7_31)”. The dataset consists of 688 harassing interactions with aggregated 16,901 tweets coming from 456 twitter accounts belonging to adolescents. If you want to access the dataset please download, read, and sign this Terms of Use (TOU) form below and email it to thilini@sc.edu. Once the signed TOU is reviewed you would gain access to the dataset. Please note that if you are working as a group and need to access this data, each member of the group needs to sign a separate copy of the TOU and send to the email address listed above.
    Link to the Terms of Use of the ALONE dataset.

    Email: thilini@sc.edu
    Date: January 17, 2023

  • Drug Abuse Ontology (DAO) : DAO provides a powerful framework and a useful resource that can be expanded and adapted to a wide range of substance use and mental health domains to help advance big data analytics of web-based data for substance use epidemiology research.

    The current version of DAO comprises 315 classes, 31 relationships, and 814 instances among the classes. The ontology is flexible and can easily accommodate new concepts. The ontology is recurrently updated to capture evolving concepts in different contexts and applied to analyze data related to social media and dark web marketplaces. DAO provides a powerful framework and a useful resource that can be expanded and adapted to a wide range of substance use and mental health domains to help advance big data analytics of web-based data for substance use epidemiology research.

    All publications that used this ontology in any form should cite the work through the following: Lokala U, Lamy F, Daniulaityte R, Gaur M, Gyrard A, Thirunarayan K, Kursuncu U, Sheth A, Drug Abuse Ontology to Harness Web-Based Data for Substance Use Epidemiology Research: Ontology Development Study
    JMIR Public Health and Surveillance. 10/05/2022:24938

  • About DAO:DAO Wikipedia page

    Link to the Terms of Use of the DAO dataset.
  • PRIMATE 2022 Dataset : This dataset can only be used for research purposes. If you are interested in having access to this data, please sign the PRIMATE dataset agreement and upload the filled PDF at the end of the form https://forms.gle/4yinUi829kUpfZPZ7 .
    The link to the dataset will then be shared with you immediately. If you face any issues, you can contact the authors at anmol.agarwal@students.iiit.ac.in shrey.gupta@students.iiit.ac.in, mgaur@email.sc.edu or kaushikr@email.sc.edu.

    Email: kaushikr@email.sc.edu
    Date: January 17, 2023


    Dataset

    Question-like labels from PHQ-9 used in the dataset:

    1. "Feeling-bad-about-yourself-or-that-you-are-a-failure-or-have-let-yourself-or-your-family-down"
    2. "Feeling-down-depressed-or-hopeless"
    3. "Feeling-tired-or-having-little-energy"
    4. "Little-interest-or-pleasure-in-doing-things"
    5. "Moving-or-speaking-so-slowly-that-other-people-could-have-noticed-Or-the-opposite-being-so-fidgety-or-restless-that-you-have-been-moving-around-a-lot-more-than-usual"
    6. "Poor-appetite-or-overeating"
    7. "Thoughts-that-you-would-be-better-off-dead-or-of-hurting-yourself-in-some-way"
    8. "Trouble-concentrating-on-things-such-as-reading-the-newspaper-or-watching-television"
    9. "Trouble-falling-or-staying-asleep-or-sleeping-too-much"
  • Reddit C-SSRS Suicide Dataset : Knowledge-aware Assessment of Severity of Suicide Risk for Early Intervention. Mental health illness such as depression is a significant risk factor for suicide ideation, behaviors, and attempts. A report by Substance Abuse and Mental Health Services Administration (SAMHSA) shows that 80% of the patients suffering from Borderline Personality Disorder (BPD) have suicidal behavior, 5-10% of whom commit suicide. While multiple initiatives have been developed and implemented for suicide prevention, a key challenge has been the social stigma associated with mental disorders, which deters patients from seeking help or sharing their experiences directly with others including clinicians. This is particularly true for teenagers and younger adults where suicide is the second highest cause of death in the US Prior research involving surveys and questionnaires (e.g. PHQ-9) for suicide risk prediction failed to provide a quantitative assessment of risk that informed timely clinical decision-making for intervention. Our interdisciplinary study concerns the use of Reddit as an unobtrusive data source for gleaning information about suicidal tendencies and other related mental health conditions afflicting depressed users. We provide details of our learning framework that incorporates domain-specific knowledge to predict the severity of suicide risk for an individual. Our approach involves developing a suicide risk severity lexicon using medical knowledge bases and suicide ontology to detect cues relevant to suicidal thoughts and actions. We also use language modeling, medical entity recognition, and normalization and negation detection to create a dataset of 2181 redditors that have discussed or implied suicidal ideation, behavior, or attempt. Given the importance of clinical knowledge, our gold standard dataset of 500 redditors (out of 2181) was developed by four practicing psychiatrists following the guidelines outlined in Columbia.Suicide Severity Rating Scale (C-SSRS), with the pairwise annotator agreement of 0.79 and group-wise agreement of 0.73. Compared to the existing four-label classification scheme (no risk, low risk, moderate risk, and high risk), our proposed C-SSRS-based 5-label classification scheme distinguishes people who are supportive, from those who show different severity of suicidal tendency. Our 5-label classification scheme outperforms the state-of-the-art schemes by improving the graded recall by 4.2% and reducing the perceived risk measure by 12.5%. Convolutional neural network (CNN) provided the best performance in our scheme due to the discriminative features and use of domain-specific knowledge resources, in comparison to SVM-L that has been used in the state-of-the-art tools over similar dataset. Link to the dataset : https://zenodo.org/record/2667859#.Y5N89-zMLdp

    Email: manas@umbc.edu
    Date: January 17, 2023

  • Characterization of Time-variant and Time-invariant Assessment of Suicidality on Reddit using C-SSRS : Suicide is the 10th leading cause of death in the U.S (1999-2019). However, predicting when someone will attempt or complete suicide has been nearly impossible. In the modern world, many individuals suffering from mental illness seek emotional support and advice on well-known and easily-accessible social media platforms such as Reddit. While prior artificial intelligence research has demonstrated the ability to extract valuable information from social media on suicidal thoughts and behaviors, these efforts have not considered both severity and temporality of risk. The insights made possible by access to such data have enormous clinical potential - most dramatically envisioned as a trigger to employ timely and targeted interventions (i.e. voluntary and involuntary psychiatric hospitalization) to save lives. In this work, we address this knowledge gap by developing natural datasets of users experiencing suicide-related ideations, suicide-related behaviors or suicide attempt (https://zenodo.org/record/2667859#.YCwdTR1OlQI) manifested through their communication on r/SuicideWatch and associated mental health subreddits. Through a widely recognized questionnaire to assess suicide risk severity, The Columbia Suicide Severity Rating Scale, the domain experts in the study annotated 448 users with following labels: Supportive (new add to C-SSRS and specific to social media), Suicide Ideation, Suicide Behavior, Suicide Attempt. High standards in annotation were maintained with substantial inter-rater agreement of 0.76. Link to the dataset : https://zenodo.org/record/4543776#.Y5N8-ezMLdp

    Email: manas@umbc.edu
    Date: January 17, 2023

Archived

  • SoCS Ontology for Crisis Coordination (SOCC): We extend the concepts of domain knowledge­-driven models, MOAC- Management Of A Crisis ontology (Limbu 2012), and UNOCHA's HXL- Humanitarian Exchange Language (Keßler et al. 2013) ontology, with required but missing concepts for organizing data during crisis response coordination for seeker and supplier behavior, and indicators of resource needs using a lexicon. For example, the 'shelter' class contains words 'emergency center,' 'tent,' and 'shelter,' along with lexical alternatives. For the present demonstration, we focus on three resource categories: food, shelter and medical needs. Thus, we endeavor to exploit a minimum, but always expandable subset that provides the maximum coverage while controlling false alarms. For creating lexicons of indicator words for concepts, we relied on various documents collected via interactions with domain experts (Flach et al. 2013), our Community Emergency Response Team (CERT) training, Rural Domestic Preparedness Consortium training, and publically available references (Homeland Security 2010; FEMA 2012; OCHA,Verity 2011). Using a first aid handbook (Swienton and Subbarao 2012), we created an extensive 'medical' subset of emergency indicators, where we identified words which pertained specifically to first aid or injuries and included those words along with variations in tense (i.e., breath, breathing, breathes) and common abbreviations (i.e. mouth to mouth, mouth 2 mouth, CPR). A local expert with FEMA experience augmented the model with additional indicators and provided anecdotal context. The current model with food, medical, and shelter resource indicators contain 43 concepts and 45 relationships. We created this domain model in the OWL language using the Protégé ontology editor (Protégé 2013). Each type of disaster is listed as an entity type with indicators for that disaster listed as individuals under a corresponding indicator entity. Therefore a relationship is declared stating that a particular disaster concept, say Flood, relates by property 'has_a_positive_indicator', with 'Flood_i' indicator entity, that includes all words under that heading. Each disaster has a declared negative relationship with the negative indicator list (e.g., 'erotic' under sexual words indicators) under the entity name Negative_Indicator_i. Finally resources are declared as individuals under the appropriate entity in the same way, but relationships are not explicitly stated with any disaster in order to provide flexibility. [Read more: Purohit et al., JCSCW 2014]
  • Singleton Property Datasets: The singleton property approach can be used to represent statements about statements in RDF without the use of reification. The main idea of the approach is to create a property instance and enforce the uniqueness of the property instance in only one triple. This approach is compatible with RDF/RDFS and SPARQL.
  • Citypulse Dataset: This webpage offers a number of semantically annotated datasets collected from partners of the CityPulse EU FP7 project and relevant resources for smart city data (over 120GB data in 6 large datsets as of November 2014).
  • Sensor and Sensor Network (SSN) Ontology: The Sensor and Sensor Network ontology, known as the SSN ontology, answers the need for a domain-independent and end-to-end model for sensing applications by merging sensor-focused (e.g. SensorML), observation-focused (e.g. Observation & Measurement) and system-focused views. It covers the sub-domains which are sensor-specific such as the sensing principles and capabilities and can be used to define how a sensor will perform in a particular context to help characterize the quality of sensed data or to better task sensors in unpredictable environments. Although the ontology leaves the observed domain unspecified, domain semantics, units of measurement, time and time series, and location and mobility ontologies can be easily attached when instantiating the ontology for any particular sensors in a domain. The alignment between the SSN ontology and the DOLCE Ultra Lite upper ontology has helped to normalise the structure of the ontology to assist its use in conjunction with ontologies or linked data resources developed elsewhere. This ontology is publicly accessible via W3C.
  • Provenir : A reference ontology for modeling domain-specific provenance. Additional information and download with a Crieative Commons license is available at: http://wiki.knoesis.org/index.php/Provenir_Ontology.
  • Proteomics data and process provenance (ProPreO): ProPreO is a large glycoproteomics provenance ontology. The ProPreO schema includes 480 classes and attendant relations where as the populated ontology includes 3.1 million instances. More information is available at the Kno.e.sis ProPreO page. This ontology is publicly accessible via NCBO bioportal.
  • Parasite Experiment Ontology (PEO): The ontology comprehensively models the processes, instruments, parameters, and sample details that will be used to annotate experimental results with provenance metadata (derivation history of results). More details are avaialable in the PEO wiki page and the ontology can be publicly accessible via NCBO bioportal.
  • Parasite Life Cycle Ontology (PLO): PLO models the life cycle stage details of T.cruzi and two related kinetoplastids, Trypanosoma brucei and Leishmania major. More information on PLO is avaialable from the PLO wiki page and the ontology is available through NCBO bioportal.
  • Linked Sensor Data and Linked Observation Data: Linked Sensor Data an RDF dataset containing expressive descriptions of ~20,000 weather stations in the United States. Linked Observation Data is a 1.7 billion triple RDF dataset containing expressive descriptions of hurricane and blizzard observations in the United States. All data is also included in Linked Open Data Cloud.
  • SWETO and SWETODBLP: Semantic Web Technology Evaluation Ontology and its follow on SWETODBLP were early populated ontologies created by extracting real world data using tools created by Taalee/Voquette/Semagix(a company founded by Prof. Amit Sheth) that were made available at no cost for research use. Latest SWETODBLP data with a Creative Commons License is available at http://knoesis.wright.edu/library/ontologies/swetodblp/.
  • City Event Extraction Dataset: Using citizen sensor observations in the form of microblogs to extract city events provides city authorities direct access to the pulse of the populace. This dataset contains textual data (tweets) collected from San Francisco Bay Area for four months and a ground truth data for traffic related events collected from 511.org. This dataset can be utilized for evaluing city event extraction techniques/algorithms as it has both textual event and the ground truth. This dataset is available with Creative Commons License on Open Science Framework.

Past Projects

  • Projects: METEOR-S [PI: Prof. Amit Sheth]
  • MobiCloud: MobiCloud is a Domain Specific Language (DSL) based platform agnostic application development paradigm for cloud-mobile hybrid applications. A cloud-mobile hybrid is simply an application that partially runs on the mobile device and in the cloud. MobiCloud makes it extremely easy to develop these applications and deploy them to clouds and mobile devices.
  • Twarql: Twarql investigate the representation of tweets as RDF in order to enable flexibility in handling the information overload of those interested in collectively analyzing social media for sensemaking. Twarql source can be accessed at https://sourceforge.net/projects/twarql/ and available under the BSD licence.
  • Cuebee: A flexible, extensible application for querying the semantic web. It provides a friendly interface to guide users through the process of formulating complex queries. Cuebee source can be accessed at http://cuebee.sourceforge.net/ and available under the Creative Commons Attribution-No Derivative Works 3.0 Unported License.
  • Kino (Also known as KinoE ) is a Web document annotation and indexing system that helps scientists annotate and index Web documents. Kino uses a browser plugin to add annotations and a Apache SOLR based backend to index and store the Web pages. Kino source can be accessed at https://sourceforge.net/p/sarestannotator/code/HEAD/tree/branches/NCBO-bound-annotator/ and available under the Apache 2.0 license.
  • The Doozer model creation framework extracts entities and relationships from text with the goal of building comprehensive formal models of emerging or continuously changing domains. Upon completion, the code will be made available here. Example domain models created with the prototype can be found on the project page: http://knoesis.org/research/semweb/projects/knowledge-extraction/
  • BLOOMS : An acronym for Bootstrapping-based Linked Open Data Ontology Matching System, BLOOMS is an ontology alignment system based on the idea of bootstrapping information already present on the LOD cloud. It was developed particularly for Linked Open Data schema alignment. Further details are available at BLOOMS Wiki page.
  • Scooner : Scooner is a prototype search application that integrating the Web of pages with the Linked Open Data. The following is a demo of Scooner. Or you can take a look at the wiki page.
  • SA-REST Annotator: SA-REST is a specification about adding annotations to Web pages. SA-REST annotator is a Firefox based browser plugin that allows users to add SA-REST annotations and publish them. SA-REST Annotator source can be accessed at http://sarestannotator.sourceforge.net/ and available under the Apache 2.0 license.
  • SAWSDL4J: A clean object model for handling SAWSDL. The source and the binaries for this project can be downloaded from http://sawsdl4j.sourceforge.net/. The software is available for use under the Apache 2.0 license.
  • Test Ontology Generation Tool (TOntoGen): TOntoGen generate large, high-quality data sets for testing semantic web applications. It has been implemented as a Protege plugin. TOntoGen can be downloaded from the LSDIS TOntoGen project page.
  • Radiant: Radiant is an Eclipse based graphical UI for annotating existing WSDL documents into WSDL-S or SAWSDL via an OWL Ontology. Radiant can be downloaded via the LSDIS Radiant project page.
  • BRAHMS: A fast main-memory RDF/S storage, capable of storing, accessing, and querying large ontologies.
  • Semantic Visualization Tools: Semantic Visualization [SemViz] was a subproject within SemDis project and developed some of the earliest Semantic Web/RDF visualization tools: Semantic Analytics Visualization [SAV] - a 3D visualization tool for semantic analytics, SET - Semantic EventTracker, a highly interactive visualization tool for tracking and associating activities (events), and Paged Graph Visualization [PGV].
  • SemDis API: A simple yet flexible set of interfaces intended to be a basis for implementations of RDF data access suitable to the types of algorithms being developed in the SemDis project.
  • x
  • Semantic Browser: A tool that demonstrates the concept of Relationship Web by creating a relationships-centric metaweb on documents. It allows users to traverse semantically connected documents through domain-specific relationships and uses research in entity and relationship extraction.

Standards

AIISC researchers (while at previous institutions) have had significant impact on standards and have shown strong leaderships in standards activities. Prof. Amit Sheth has served as a W3C advisor committee member since 2002. Prof. Sheth and his team defined WSDL-S for annotating semantic web services (see also), which was submitted to W3C in collaboration with IBM. SAWSDL, adopted as a recommendation (standard) in 2007 was directly based on WSDL-S, and Kno.e.sis members were active members of the W3C SAWSDL working group that defined SAWSDL. Prof. Sheth also co-chaired W3C Semantic Web Service Testbed Incubator Group (XG) [Kno.e.sis contributors: Karthik Gomadam, Meena Nagarajan, Ajith Ranabahu, Amit Sheth, Kunal Verma, with John Miller at UGA] Prof. Sheth proposed GLYDE which was subsequently developed by his team with Prof. William York UGA's Complex Carbohydrate Research Center. GLYDE-II, an XML standard for data exchange that has been accepted as the standard protocol by the leading carbohydrate databases in the United States, Germany, and Japan. [Kno.e.sis/LSDIS contributors: Cory Henson, Satya Sahoo, Amit Sheth, Christopher Thomas] Prof. Sheth was an active early member of W3C Semantic Web for Health Care and Life Sciences interest group (HCLSIG), and provided its earliest use case based on Active Semantic Electronic Medical Record, an operationally deployed semantic web application in clinical setting since January 2006. Dr. Satya Sahoo (Advisor: Prof. Sheth), while at Kno.e.sis was a key participant in W3C Provenance XG. He defined semantic provenance and developed Provenir ontology that influenced Provenance XG's related work, which where then adapted by Provenance Working Group. Satya (now at Case Western Reserve University) was a contributor to the W3C Provenance XG Final Report. Dr. Matthew Perry (Advisor: Prof. Sheth), who worked on SPAQL-ST and spatial, temporal, and thematic analytics over Semantic Web data at Kno.e.sis, continued his work on spatial extension of SPAQRL after joining Oracle. He was one of the two editors of Open Geospatial Consortium's OGC GeoSPARQL - A Geographic Query Language for RDF Data. Prof. Sheth co-founded and co-chaired W3C Semantic Sensor Networking (SSN) XG which developed now widely developed SSN Ontology. Cory Henson is a co-editor of the SSN XG Final Report and the primary author of semantic sensor data annotation aspects. [Participants: Cory Henson, Amit Sheth] Prof. Sheth and his team has developed SA-REST which is also a W3C SA-REST member submission. Many use cases and tools have been developed to support SA-REST based semantic web service annotation, semantic search/discovery of Web APIs and REST-ful services, etc. [Participants: Karthik Gomadam, Ajith Ranabahu, Amit Sheth] Kno.e.sis researchers have participated in or are participating in a number of W3C working/interest/community groups, including: Linked Data Platform, HCLSIG.