Big Data Architecture
- Mar 13, 2024
- 3 min read
Updated: Aug 25, 2024
Purdue University Global - School of Business and Information Technology
Cybersecurity Management Graduate Student
Prof. Susan Ferebee - Security for Data Analytics
IN505 Module 1 Assessment 1: Security Issues for Data Analytics
March 13, 2024
Contents
Introduction
Architectural Vulnerabilities
Unstructured Data Vulnerabilities
ETL Vulnerabilities
Conclusion
References
Introduction
Understanding the vulnerabilities of big data is crucial. The extraction, transformation, and loading (ETL) of large volumes of unstructured data, processed using a schema, can create transient execution vulnerabilities within multicore CPU threads. These vulnerabilities, difficult to fully mitigate, have led to attack variants that pose significant threats to our critical protected information and information systems under the U.S. FISMA Act of 2002, 44 U.S.C. §3542.
Architectural Vulnerabilities
With hundreds of terabytes of lengthy table scans, limited storage, significant data redundancy, and costly continuous network traffic, big data began to emerge. OLAP protocols in the 1980s connected multiple RDBMS solutions, and their structured archival of partitioned back-office data systems needed revision for widespread use. It was time to take a less structured approach (Dailey, 2014). No single definition of big data exists, but four specific identifying factors remain. Significant volume size, cheap storage, Roman Consensus methods, and stored and managed unstructured formats have provided lower cost and timely benefits for predictive analytical knowledge. The need for microarchitectural changes to CPUs that made massive parallel processing of the same amount of data possible in less time enabled many changes to how large amounts of data could be processed (Inmon & Linstedt, 2015). Given the fact that all big data is unstructured, the unpredictability of poorly governed and controlled data pipelines poses significant transient execution vulnerabilities for the continued development of threat variants of Spear, Spectre, Meltdown, Foreshadow, Ghostrace, and future attack vectors (Hany et al., 2024) & (Mambretti, 2021).
Unstructured Data Vulnerabilities
The unstructured nature of big data presents a significant challenge to data analytics security. While repetitive and consistent data is less of a threat, non-repetitive unstructured data containing valuable business data poses significant challenges. Forming a dependent or logical pipeline to data marts using parallel processes vs. an independent data mart from an operational data store is complex. Once the schema is processed and data parsing is complete, the lack of structure leaves the possibility of many textual ambiguities, possible privacy violations, and a persistent need for data quality feedback and threat analysis.
ETL Vulnerabilities
Attack classification trees are currently used to demonstrate hardware logic in prior pipelines that have stalled, performed operations and instructions before they should, or executed the extractions, transformations, and loads in the pipeline out of order. ETL creates a need to create governed, controlled, and predictable instruction streams that protect cache states from exploits at the architectural level and have the potential to damage hardware. Covert channels used to exploit control or data flow mispredictions, faulting instructions, or out-of-order executions that bypass bounds checking, function call/return abstractions, memory stores, or unauthorized results computation of faulting instructions are just a few ways big data parallel processing exploits bypass hardware-enforced security policies. (Canella et. al, 2019).
Conclusion
Continuous monitoring and response to threats is essential. Applying C.I.A. triad and Parkerian hexad Security Models, vulnerabilities such as these to mitigate physical, logical, and administrative risks can prevent interception, interruption, modification, and fabrication of information. Identifying what is at risk and the impact an attack will have on identified assets, threats, assessed vulnerabilities, assessed risk, and risk mitigations is a continuous process. Utilizing incident response is a great preparation and training tool that can provide valuable experience to prepare, detect, analyze, contain, eradicate, recover, and review attack scenarios to see how you can improve the depth of your defensive layers.
References
Inmon, W.H., & Linstedt, D. (2015). Data architecture: A primer for the data scientist: Big data, data warehouse, and data vault. Morgan Kaufmann.
Canella, C. & Bulck, J.V. & Schwarz, M. & Lipp, M. & Berg, B.V. & Ortner, P. & Piessens, F. & Evtyushkin, D. & Gruss, D. (2019). A Systematic Evaluation of Transient Execution Attacks and Defenses. 28th USENIX Security Symposium. (240– 266). https://css.csail.mit.edu/6.858/2024/readings/transient-execution.pdf
Andress, J. (2019). Foundations of information security: A straightforward introduction. No Starch Press.
Federal Information Security Modernization Act of 2002, 44 U.S.C. §3542.
Hany, R. & Mambretti, A. & Kurmus, A. & Giuffrida, C., (2024, Mar 12). Ghostrace: Exploiting and Mitigating Speculative Race Conditions. 33rd USENIX Security Symposium (USENIX Security 24). https://www.vusec.net/projects/ghostrace
Mambretti, A. & Sandulescu, A. & Sorniotti, A. & Robertson, W. & Kirda, E. and Kurmus, A. (2021). Bypassing memory safety mechanisms through speculative control flow hijacks. IEEE European Symposium on Security and Privacy EuroS&P. (633-649). https://doi.org/10.1109/EuroSP51992.2021.00048
Sampson, A. (2016). Teradata basics: Data marts [Video]. Skillsoft.
Dailey, W. (2014). Big data: Big data and data warehouses [Video]. Skillsoft. https://libauth.purdueglobal.edu/sso/skillport?context=72256
Comentários