The Third Colloquium on Analytics, Data Science, and Computing
|Event Program (PDF)||Accepted Research Papers||Keynote Video||Call for Papers|
The 3rd Colloquium on Analytics, Data Science, and Computing (CADSCOM 2021) was held virtually from 8:30 am to 5:30 pm CDT on March 20. Download the printable copy (PDF) of the CADSCOM 2021 event program.
In addition to research paper presentations, CADSCOM 2021 featured the keynote address, panel discussions, and invited talk. All research papers were peer reviewed (double anonymous) for quality.
Top CADSCOM 2021 Papers: The following three papers were selected after two separate rounds of peer reviews. These top papers were recommended for fast-track review for the Journal of the Midwest Association for Information Systems (JMWAIS):
Keynote Address Video: Dr. Radhika Kulkarni
Accepted Research Papers
Title: Immutable Infrastructure with Actionable Monitoring on Containers (Kubernetes)
Author(s) and Affiliation: Mizan Hemani, Minnesota State University, Mankato
Abstract: With the dawn of cloud computing and the growing popularity of containers that run applications and microservices – it has become easier to build new architectures that are deployable as smaller cohesive segments that are highly scalable. Having this container level deployment makes it easier to manage deployments between different environments, however, it carries forward the existing behaviors of directly interacting with the server, while avoiding the pre-configured deployment pipeline – potentially creating a drift in configuration and exposing the system to security vulnerabilities. In this paper, we explore the lack of immutability in a container infrastructure by monitoring audit level logs of interactions with Kubernetes to perform actions on established policies. By leveraging such policies, this paper proposes a pattern that can ensure an intact infrastructure and re-enforce good security and system maintenance principles.
Title: Delay Tolerant Network Security
Author(s) and Affiliation: Rishabh Yata, Minnesota State University, Mankato
Abstract: A delay-tolerant network or DTN is a store and forward network where end-to-end communication is not assumed and where data transmission is performed using opportunistic connections between nodes. DTN is a sparse wireless network that has recently been used by the existing network to link devices or the underdeveloped world in a challenging environment. In any protected environment, such as the military, the network security protocol is often needed. In DTN, the complete path from resource to target does not exist for the most part, which contributes to the difficulty of routing the packet in such an area. For the large implementation of delay-tolerant networks, protection and privacy are essential. People are hesitant to consider such a new network model without protection and privacy assurances. Therefore, in this paper, I plan to discuss various security, as well as cryptography concepts and protocols which are currently in use and propose some promising enhancement concepts to DTN security.
Title: Using Prototyping to Teach Design Thinking
Author(s) and Affiliation: Mary Lebens, Metropolitan State University
Abstract: Companies using design thinking increase revenues and shareholder returns at almost double the rate of their industry peers, yet more than 90% of companies do not employ design thinking, in part due to a lack design skills in the workforce. Adding design thinking to the curriculum is imperative to address this skills gap. Most research emphasizes developers and users physically working together, so it is significant to learn whether online students who are never physically present together in the classroom can successfully learn design thinking skills. This study examines whether students in an “asynchronous online” undergraduate systems analysis course can successfully apply user-centered design standards to develop a system prototype. Additionally, the study examines if students are able to provide substantive feedback to their peers on their prototypes while participating in an iterative review process. The study method employed a model for prototype design, review, and assessment. The study demonstrates that over two course sections, the majority of students in an asynchronous online course successfully developed web prototypes that employed user-centered design, as well as effectively providing feedback to peers on their prototypes during an iterative review process. The implication is faculty can feel confident in employing design thinking and prototyping in asynchronous online courses to teach these valuable skills.
Title: Evaluation of P2P Loan Default Detection Models
Author(s) and Affiliation: Queen E. Booker , Metropolitan State University and Mousumi Munmun, Metropolitan State University
Abstract: The Peer-to-Peer (P2P) lending model is exploding in the US economy. A robust charge off/default detection method is needed to improve the quality of the P2P lending market and establish a more sustainable industry. The study specifically compares the Zhang (2020) Logistic Regression (LR) model to a Deep Learning Neural Network (DLNN) and Naïve Bayes (NB). However, based on the Lending Club dataset and Zhang’s (2020) variables, no model was particularly effective at detecting potentially bad loans.
Title: What do the Twitter sentiments say about the COVID-19 Vaccine?
Author(s) and Affiliation: Ilma Sheriff, Computer Information Science, Minnesota State University, Mankato and Naseef Mansoor, Minnesota State University, Mankato
Abstract: The coronavirus disease (COVID-19) pandemic led to substantial public discussion. Understanding these discussions can help institutions and individuals navigate through this pandemic. In this paper, we analyze and investigate the twitter sentiments toward COVID-19 vaccine. Starting from a publicly available twitter dataset on COVID-19 vaccine from Kaggle, we create a unified dataset containing data about public sentiments, sentiment scores, and COVID-19 cases for various U.S. states. To generate a sentiment scores from the tweets, we have applied a Valence Aware Dictionary and sEntiment Reasoner (VADER) sentiment analyzer. These scores were then classified to positive, negative, and neutral sentiment classes using a simple threshold-based classifier. From our analysis, we observe that in our dataset around 41.93% of the tweets are positive, 17.64% tweets are negative, and 40.42% tweets are neutral. We also analyzed the data based on geographic locations of the tweets to answer the following questions – 1) Is there any relationship between the number of tweets and the number of COVID-19 cases? 2) Is there any shift in the public sentiment after the approval of the vaccine? Our analysis shows high correlation between the number of tweets and the number of COVID-19 cases as well as a decrease in negative sentiment after the approval of the vaccine.
Title: Automated stock recommendations using Financial Indicators and Machine Learning (Full Paper PDF)
Author(s) and Affiliation: Utkarsh Sharma, ASET, Amity University and Simran Gogia, ASET, Amity University
Abstract: Stock market is suggested and regarded as one of the high-yielding long-term investments, yet a majority of people don’t capitalize on the same. Dubious advice and attempts to ‘beat the market’ usually give rise to skepticism and distrust among first-time investors. This paper proposes a subjective, low-risk stock market advising platform that leverages Machine Learning clustering (K-Means) on basic Financial Indicators that are used to track the performance of stocks in the exchange to serve as an aid in investment decision, particularly for first-time investors. The results suggest that clustering-powered subjective recommendations can prove to be a low-risk advising tool.
Title: Strategies that Guide the Availability, Information Security, and Scalability of Future Wireless Sensor Networks (WSNs)
Author(s) and Affiliation: Sapumal Darshana Salpadoru Thuppahi, Minnesota State university Mankato and Michael Hart, Minnesota State university Mankato
Abstract: Wireless Sensor Networks (WSNs) facilitate the opportunity for industries to manage vast amounts of sensors over various types of computer networks. New WSN research indicates several advantages for industries currently not using its associated technological advancements. To help these industries, the authors outline guidance that help inform future WSN implementation frameworks. Using this guidance, the authors propose an iteration of a new WSN model for agriculture. The prototype addresses several needs, including high availability, information security, and scalability of wireless sensor networks using commodity hardware often present in this industry.
Title: Twitter Data Analysis about COVID-19 Vaccines using Sentiment Analysis
Author(s) and Affiliation: Maharu Chamara Wickramarathne, Minnesota State university Mankato
Abstract: The world took tremendous measures to find a cure for COVID-19. After multiple attempts at vaccines against the virus, two vaccines got approved by Food and Drug Administration (FDA) and World Health organization to distribute in USA. They are the Pfizer/BioNTech COVID-19 vaccine and Moderna COVID-19 vaccine. But people are curious of lot questions about the vaccines (“What are the side effects?”). Addressing answers to these questions and doubts are necessary for successful vaccination of the people. This research is addressing to answer these questions using twitter data. Twitter data was analyzed by mining two thousand tweets (hash tag by vaccine name) in Minnesota State for each vaccine. These tweets revealed most people’s opinion about the vaccine and how well they performed. Twitter data mining and cleaning procedures in R was used to get a better insight. Use of Word Cloud data visualization technique and Sentimental Analysis methods helped to explore those questions among the people in Minnesota.
Title: The Impact of AES Encryption on SCADA Systems for Electrical Distribution that Contain HDFS Architecture
Author(s) and Affiliation: Justin Wren and Michael Hart, Minnesota State University, Mankato
Abstract: Supervisory Control and Data Acquisition (SCADA) systems for electrical utility companies have an increasing need to provide additional insight into smart grid data. A significant contingency is the ability to design information security and big data architecture into IT infrastructure that demands minimal network latency. This study explores an IT infrastructure design for electrical generating stations that have the capability to stream encrypted internal SCADA data to a Hadoop Distributed File System (HDFS). Using the design science research methodology, the authors designed and implemented an IT critical infrastructure that uses the Advanced Encryption Standard (AES) between primary SCADA systems and intelligent electronic devices (IEDs). Results illustrate a marginal difference in network packet latency between security gateways that load balance individual relays to IEDs and single instance security gateways that handle all relays to IEDs using a LAN substation. Despite the introduction of network latency, the proposed critical IT infrastructure design decreases the amount of unencrypted data in SCADA environments and could allow streaming data securely to HDFS. Findings emphasize that carefully designing security gateways and encryption in SCADA systems is a viable and necessary step when considering streaming data from IEDs to big data environments.
Title: Blockchain in COVID-19 Vaccine Distribution
Author(s) and Affiliation: Tiati Thelen, Minnesota State University, Mankato and Rajeev Bukralia, Minnesota State University, Mankato
Abstract: Supply chain management has started utilizing blockchain technology to access information from the start of production to the consumer. Blockchains create records of consistent information. Recently, blockchain technology has been introduced into the pharmaceutical supply chain to track temperatures of vaccines from production to patient. Additionally, IoT (Internet of Things) assists blockchains by utilizing embedded sensors and software to supply blockchains with the pertinent information. It is vital because vaccines are temperature sensitive. This research provides the foundations to consider these technologies in the domain of the COVID-19 vaccine which is unique such that many are produced in two doses. This paper contributes a systematic review of previous works and how it can effectively be advanced to the COVID-19 vaccine supply
Title: Detecting Online Review Fraud Using Sentiment Analysis
Author(s) and Affiliation: Bryn Caron, Minnesota State University, Mankato and Rajeev Bukralia, Minnesota State University, Mankato
Abstract: With the exponential increase in e-commerce, online reviews have become integral to the marketing of products and services. Customers are inclined to buy products and services that have received high ratings and positive reviews. Consequently, fake reviews are increasingly becoming a way to mislead customers into trusting, or mistrusting, the credibility and reliability of a product or service. Though online fake reviews have garnered some attention from the media and research communities, there is a need for effective technical solutions for detecting, and therefore mitigating, fraudulent reviews to improve consumer confidence in e-commerce. The purpose of this study is to explore the use of natural language processing techniques in detecting fake online reviews. We analyze the text of online reviews for various book titles. We investigate the accuracy of the polarity score, a common metric used in sentiment analysis, in the context of the star rating of the reviews. Our findings conclude that the polarity score is not a reliable measure for detecting fake reviews. In addition, the study sheds light on the limitations of sentiment analysis in detecting fake reviews.
Title: Ensemble Learning for Authorship Verification
Author(s) and Affiliation: Abdul Wahab Mohammad, Minnesota State University, Mankato and Dr. Michael Hart, Minnesota State University, Mankato
Abstract: Authorship verification is the task in which the author of a given text is identified. In this paper, the author proposes two novel methods to identify authors of the text on two different benchmark datasets namely C50 dataset and Guternberg dataset. The author used BERT which is the state-of-the-art NLP model with Siamese networks and tf-idf with attention models. The BERT model has shown very good results on the training data, but it did not generalize well on the testing data. However, the model with tf-idf and attention mechanism has managed to achieve comparable to state-of-the-art results on C50 dataset. This paper also discusses how word2vec based preprocessing approach works in identifying authors via Siamese networks.
Title: Chatbot Knowledge Retrieval Supported by Forums
Author(s) and Affiliation: Michael A. Nyakonu, Metropolitan State University
Abstract: In the paper we will be looking at how implementing a chatbot system that has a dynamically growing pool of knowledge can be developed. We shall look at how at a forum’s structure can be used as a source of infinite knowledge. The answers will be derived through web crawling. In return we hope to demonstrate a new model that provides infinite knowledge base to the chatbot developers
Title: Game Prediction Model(s) for the National Basketball Association
Author(s) and Affiliation: Qin Sun, Minnesota State University, Mankato and Logan Cook, Minnesota State University, Mankato
Abstract: According to Forbes statistics, there are 750 million families watching National Basketball Association (NBA for short) games in 212 countries. The NBA has become the most globalized and influential professional sports organization in the world. As a sports league with an annual revenue of more than 4 billion U.S. dollars, predicting the outcome of NBA games is an interesting thing with great commercial value. In this article, we selected the team and player data for all seasons of the NBA from 2004 to 2020, using the R language, with thirty different data splits to bring thirty different accuracy to each model. Our conclusion shows that K-Nearest Neighbor Classifier has lowest prediction accuracy during these 4 models, while the SVM classifier has the most accurate effect.
Title: Roadmap Comparison: Telehealth and NIST
Author(s) and Affiliation: Pamal Wanigasinghe,Minnesota State University, Mankato and Sarah Klammer Kruse, Minnesota State University, Mankato
Abstract: Telehealth has great potential to increase patient access to health services, decrease costs, and improve individual and public wellbeing. In order to fully realize these advantages, patients need to be assured that their health-related data will be protected, and providers must take responsibility for the security and integrity of the data gathered. Adoption and use of telehealth could be reduced or delayed if security risks are not adequately addressed. As the popularity of telehealth increases, it is important to emphasize information security for this emerging healthcare technology. A successful telehealth security plan should include all aspects of security including the underlying frameworks, policies, and education of providers and patients. This paper explores security and privacy risks of telehealth and compares the telehealth roadmaps from two organizations to the recommendations given in the Roadmap for Advancing the NIST Privacy Framework.
Title: An OS Benchmark Design to Compare SQL Load on Distributed Big Data Systems
Author(s) and Affiliation: Michael Hart, Minnesota State University, Mankato
Abstract: Although vendors publish key benchmarks of big data systems, under typical industry load and fluctuating network environments results can differ. This work develops a SQL load benchmarking process by employing the Design Science methodology. The proposed experimental process measures varied operating systems under normal business load for a popular distributed big data system. Using a modified version of the IBM supported TPC-DS workflow, the author tests SQL completion times on three separate Apache Spark distributed clusters running Ubuntu Server, Clear Linux, and CentOS Server. Results indicate load in real-life big data environments have a significant effect on SQL completion times.