Purdue University Global School of Business and Information Technology,
Masters of Science in Cybersecurity Management
IN 505 Module 4: Protecting Big Data
Prof. Susan Ferebee
August 13th, 2024
Abstract
Data privacy issues bring about key concerns regarding data collection and protection. Consent of use should be informed and scoped to specific user roles and attributes. Data sharing and third-party access or data brokers should be fully disclosed to users. Security breaches via hacking, data leaks, or weak security measures should be mitigated, and true positive risks should be a transparent part of the user experience that can be actively and safely reported upon. User profiling and surveillance of behavior by the government should be considered as having a considerable privacy impact on users. Ethical concerns of doxing, harassment, manipulation, and misinformation should be mitigated, and an action plan should be taken to prevent it from becoming part of the user experience. Complex privacy setting controls and public defaults should start with the least privilege, and minimized data options should be enabled on new accounts. Compliance with data privacy regulations should be balanced with the usability of the social media monitoring and analytics data strategy while ensuring that data minimization, anonymization, pseudonymization, user consent, user transparency, data security, regular audits, compliance checks, user controls, user preferences, and ethical considerations are put in place for the data being sourced. Big data use cases for social media monitoring and analytics require several data privacy techniques and methods to ensure compliance and protect user information for regulations like GDPR and CCPA. Trustworthy services and transparency that can build trust surrounding user privacy and regulatory compliance are of primary concern in big data, analytics, and machine learning environments.
Keywords: Data, Data Analytics, Privacy Threats, Privacy Preservation
Introduction
Data sources for social media monitoring and analysis come from multiple social media platforms and can be used to understand trends, reactions, and effects of different data strategies online. Key data types used for monitoring and analytics include engagement metrics, reach, impressions, audience demographics, sentiment analysis, traffic data, competitor analysis, content performance, and influencer impact. Gathering these various data types involves sourcing, governing, and controlling the security and privacy of the different big data source collections into reports, insights, and application programming interfaces (APIs) (Rahman & Reza., 2021). From the structured data of engagement metrics and demographics to the unstructured data in text posts, images, and videos, there are repetitive weekly and daily metrics, scheduled posts, non-repetitive viral content, and even crisis events to graph, monitor, and analyze. These data types and attributes can be combined to gain an understanding of sentiment and influence on a specific role or audience (Sutherland, 2020).
Data being actively processed, accessed, or manipulated by a computer system is “in use.” Real-time operations rely on this type of critical data, which is non-persistent within computer RAM, CPU caches, or CPU registers. The challenging aspect of these operations and applications comes with the unique problem of securing sensitive information (Müller & vom Berge, 2022). Encryption keys, user input, database queries, streaming data, and real-time analytics should remain protected and secured. However, data volume, latency, and integrations increase the complexity, time to process, and time to combine information from multiple systems and can leave them vulnerable. Competitive and responsive business models may require real-time analytics to remain profitable and meet the demands of consumers. Still, they must do so while securing the data and maintaining the privacy (Cerruto, 2022) of individual identities, health information, payment information, and intellectual property while the data is in use.
Key Techniques
Assessing the security and privacy of social media monitoring and analytical processes for anonymization of data sets requires quantifying the removal of personally identifiable information (PII), personal health information (PHI), payment card information (PCI), or intellectual property (IP) to measure the anonymity of the data. There are many popular algorithms for anonymization. Three main anonymization techniques are K-anonymity, L-Diversity, and T-closeness. K-anonymity enforces a data set with several identities represented by (k) that cannot be identified apart from other identities. L-diversity goes beyond K-anonymity, requiring a minimum of (1) well-represented values in sensitive attributes. T-closeness enforces class equivalence throughout the distribution of sensitive attributes across the entire data set. Each technique should be assessed, quantified, and measured to rate the effect on the utility of data sets once applied (Tomás et al., 2022). Open-source software like ARX Data Anonymization Tool, Amnesia, and Presidio can help manage and protect sensitive data (Monteiro et al., 2023).
When maintaining the utility of the data is paramount, assess the replacement of private identifiers with pseudonyms or fake identifiers to protect PII, PHI, PCI, and IPs. Evaluation of identifiers against key pairs (composite or semantic keys) should be assessed, quantified, and measured to rate against organizational standards and regulatory requirements that would de-identify individual data, provide security, maintain data utility, and facilitate cross-organizational data sharing (Khan, 2018). Medical coding for IDC-10 or data conversions to IDC-11 are two examples, and this method is best applied when aggregated anonymous data is needed for analysis (Cardinal et al., 2024).
When output is limited in disclosure of PII, PHI, PCI, or IPs, differential privacy can be used to add random noise to data or queries. The ability to exclude or include single data points may affect analysis outcomes, and various assessments on the effect of single data points on analysis outcomes may be required to provide differential comparison information between these varied data set results. Emphasis on why should be understood by those analyzing the data and the amount of privacy and security differential scenarios provided may be determined as appropriate for a particular permissible use or purpose (Javed et al., 2023).
Some techniques are better used to obscure data within a database through static (where sensitive data is always masked) and dynamic data masking (where masking is performed in real-time and based on authorization). These techniques provide a more permanent mask for data outside the production environment. Generalization can lower the granularity in which data is represented, or suppression can protect privacy by removing specific data access. Knowing what data should be governed through data masking and assessing that the process is being followed and documented should be part of the assessment for the security and privacy of all social media monitoring and analytics data. (Chia, 2024). On-the-fly data masking during transmission should be part of any environment where big data is high-velocity. However, hybrid solutions like format preservation through encryption algorithms are vital to compliance and operational continuity requirements, and blockchain is a popular technology for permanently masking database ledger entries (Javed et al., 2023) while utilizing homomorphic encryption for federated learning (FL) models based on local differential privacy (LDP) while creating a zero-trust environment.
Encryption algorithms should be applied to specific contexts like data sensitivity, data volume, compliance standards, performance requirements (processing power & latency), key length, key rotation, secure key storage (FPIC, 2020), proven up-to-date algorithms, multiple encryption layers, and other security measures (Stouffer, 2023). Symmetric encryption (a single key to encrypt and decrypt) involves algorithms like the Advanced Encryption Standard (AES), Triple Data Encryption Standard (3DES), and Blowfish (a fast and efficient cipher block encryption method). It is best used for data storage encryption. Asymmetric encryption uses key pairs, one for public encryption and a private key for decryption for algorithms like Rivest-Shamir-Adleman (RSA) and Elliptic Curve Cryptography (ECC). It is best used for digital signing and key data exchanges in transit (He, G., 2023). Hashing should be used for data integrity and masking with Message Digest Algorithm 5 (MD5) or Secure Hash Algorithm 256-bit (SHA-256) to ensure that data was not altered in transit. Hybrid encryption of TLS/SSL combines symmetric and asymmetric encryption and is used for Web Application services to protect data in transit. End-to-end encryption (like those offered by WhatsApp and Signal) ensures communication can only be encrypted on the sender's device, and decrypted on the receiver's device should also be used to protect data in transit. Regulations like the Health Insurance Portability and Accountability Act (HIPPA), Family Education Rights and Privacy Act (FERPA), and Fair Credit Practices Act (FCPA) require encryption of sensitive and protected data by law. For in-use data, homomorphic encryption allows for the computation of encrypted data without decrypting it, which is helpful for situations involving cloud processing that require secure data analysis. Quantum encryption is still experimental but uses quantum mechanics theory in principle to secure data against quantum computing threats.
Network IDS, IPS monitoring, and Syslog scans should monitor and enforce encryption to ensure data remains protected while stored, in transit, and in use (Awotunde et al., 2023). Packet sniffing, Syslog monitoring, vulnerability scans, and pen testing of network devices should actively test and alert network operations and security staff of all unencrypted traffic and possible privacy violations. AES should be used to store and encrypt large volumes of data quickly. Key exchanges and digital signatures should utilize RSA encryption. The data in use should utilize homomorphic encryption. There are three types of homomorphic encryption. Partially Homomorphic Encryption (PHE), where only one operation is supported (RSA supports multiplication, and Pailier supports addition); Somewhat Homomorphic Encryption (SHE), where a limited number of operations are supported for only a specific set of operations. Fully Homomorphic Encryption (FHE), where both addition and multiplication are allowed on arbitrary data without restrictions, like in Gentry’s scheme. Homomorphic encryption benefits data privacy, secure data outsourcing in untrusted environments, regulatory compliance, and enhanced security.
User roles and permissions that restrict access to data or access control lists provide another way to protect sensitive data from unauthorized personnel. Role and Attribute-based access controls (RBAC and ABAC) can be set up for many roles and attributes across data sets. A thorough data analysis to determine access based on roles and attributes should be considered when reviewing data governance policies and assessing permissions controls. Spatial big data has become essential in determining access restrictions based on zero-trust locations and networks (Ram Mohan Rao et al., 2022).
Data collection should also be minimized to only the data necessary for a specific purpose with notifications during collection via privacy notices, fair processing statements, or privacy policy updates. Data minimization during a data breach can reduce the risk and enforce compliance with privacy regulations. Part of data governance and controls should include this requirement for minimal exposure for permissible purposes only, explicitly stated. An oversight committee should perform timely acknowledgments and reviews of any questionable or unnecessary data to clarify if data collection is adequate, relevant, or necessary relative to the purpose and process (Khan, 2021). Checkboxes are a useful tool to minimize freeform text. AI and Machine Learning automation are useful tools to prevent data ingestion outside permissible purposes into systems across the enterprise and foster a needs-based retention policy.
In general, publishing data to protect individual privacy is the best practice. The term is privacy-preserving data publishing. The published data should maintain reduced granularity and remove specific data to protect entries that can be used to identify individuals or trade secrets. Using different data sets, assessments can be done to determine the difficulty of individual and trade secret identification (Dutkiewicz et al., 2022).
Randomization to slightly alter data to prevent exact identification is another valuable technique for analysis that does not require data to be exactly as entered. Making the data more difficult to parse and join can help make personally identifiable information harder to retrieve from multiple datasets. Differential privacy provides randomization, as does randomizing survey responses, perturbation to maintain data patterns with random noise added for trend and pattern data collections, and random hashing to prevent reverse-engineering data. Random techniques help keep the utility and privacy of social media monitoring data and analytics (Srivastava & Singh, 2021).
Privacy impact assessments (PIAs) should be done as mandated to evaluate privacy impacts during data processing activities. This practice can mitigate privacy risks before implementing new data processing jobs that may expose PII or PHI. Implementing these techniques effectively balances the need for data usability with the imperative to protect user privacy. This approach ensures compliance with data protection regulations and fosters user trust (Kuroda et al., 2024). Audits, compliance checks, user transparency, and ethical considerations are also important, especially in cases that process medical data, provide financial services, or facilitate secure data processing in cloud environments (Kazaure et al., 2024). When homographic encryption becomes intense due to computations, performance overhead can be a significant challenge. This can result in longer processing times and require specialized knowledge workers to implement and manage data processing schemes (Jiang & Raymond Choo, 2018).
Conclusion
These are powerful tools for enhancing data security and privacy when used by third parties or untrusted domains. These essential techniques can keep the entire process private by performing computations on encrypted data without decrypting it. When analyzing issues related to outsourced data storage and processing of big data, these techniques will allow one to measure and evaluate the benefits and challenges of each when incorporating data, inferences, and reasoning to solve analytical problems while maintaining privacy regulations in published data sets. Protected data for social media monitoring and analysis will be primarily unstructured, non-repetitive data, and leadership will be well advised to analyze the types of data present in each social media post and platform. While some data may be structured, it is also related to unstructured data. While some posts may be repetitive, most will be non-repetitive) This should help determine what data privacy issues are within the monitoring and analytical data for each data source. There is a balancing act between data privacy compliance and the usability of social media monitoring and analytics data for specific purposes. When evaluating the encryption techniques available, recommend the methods and techniques that would most safely allow the use of big data for the stated purpose while protecting the privacy of the data.
References
Awotunde, J., Gaber, T., Prasad, N., Folorunso, S., & Lalitha, V. (2023). Privacy and Security Enhancement of Smart Cities Using Hybrid Deep Learning-Enabled Blockchain. Scalable Computing: Practice & Experience, 24(3), 561–584. https://doi.org/10.12694/scpe.v24i3.2272
Cerruto, F., Cirillo, S., Desiato, D., Gambardella, S.M., & Polese, G. (2022 Feb 14). Social network data analysis to highlight privacy threats in sharing data. Journal of Big Data, 9(1), 1-26. https://doi.org/10.1186/s40537-022-00566-7
Chia, A. (2024 Mar 22). What is Data Masking?. Splunk Blogs. Retrieved 2024 Aug 18 from https://www.splunk.com/en_us/blog/learn/data-masking.html
Dutkiewicz, L., Miadzvetskaya, Y., Ofe, H., Barnett, A., Helminger, L., Lindstaedt, S. & Tugler, A. (2022 Sept 9). Privacy-Preserving Techniques for Trustworthy Data Sharing: Opportunities and Challenges for Future Research. Springer, Cham. 319– 335. https://doi.org/10.1007/978-3-030-98636-0_15
Federal Partnership for Interoperable Communications (FPIC). (2020). Operational best practices for encryption key management. Cybersecurity and Infrastructure Security Agency. Retrieved on 2024 Aug 18 from https://www.cisa.gov/sites/default/files/publications/08-19-2020_Operational- Best-Practices-for-Encryption-Key-Mgmt_508c.pdf
He, G. (2023). Distributed Intelligent Model for Privacy and Secrecy in Preschool Education. Applied Artificial Intelligence, 37(1), 1–23. https://doi.org/10.1080/08839514.2023.2222494
Jiang, R., Lu, R., & Raymond Choo, K.K. (2018 Jan), Achieving high performance and privacy-preserving query over encrypted multidimensional big metering data. Future Generation Computer Systems, 78(1), 392-401, https://doi.org/10.1016/j.future.2016.05.005
Khan, M. J., (2018 Jan 1). Big Data Deidentification, Reidentification and Anonymization. ISACA Journal. 2018(1). https://www.isaca.org/resources/isaca- journal/issues/2018/volume-1/big-data-deidentification-reidentification-and- anonymization
Khan, M. J., (2021 Mar 29). Data Minimization – A Practical Approach. ISACA Industry News. Retrieved 2024 Aug 18 from https://www.isaca.org/resources/news-and- trends/industry-news/2021/data-minimization-a-practical-approach
Kazaure, A. A., Yusoff, M. N., & Jantan, A. (2024). Digital Forensic Investigation on Social Media Platforms: A Survey on Emerging Machine Learning Approaches. Journal of Information Science Theory & Practice (JIStaP), 12(1), 39–59. https://doi.org/10.1633/JISTaP.2024.12.1.3
Monteiro, S., Oliveira, D., Antonio, J., Sa, F., Wanzeller, C., & Martins, P. (2023 Sept 5). Data Anonymization: Techniques and Models. ICMarkTech: Marketing and Smart Technologies (SIST), 344, 73-84. https://doi.org/10.1007/978-981-99-0333-7_6
Müller, D. & vom Berge, P. (2022). Institute for Employment Research, Germany: International Access to Labor Market Data. In: Cole, Dhaliwal, Sautmann, and Vilhuber (eds), Handbook on Using Administrative Data for Research and Evidence- based Policy, v1.1. https://admindatahandbook.mit.edu/book/v1.1/iab.html
Rahman, M.S., Reza, H. (2021 Jun 5). Big Data Analytics in Social Media: A Triple T (Types, Techniques, and Taxonomy) Study. In: Latifi, S. (eds) ITNG 2021 18th International Conference on Information Technology-New Generations. Advances in Intelligent Systems and Computing. 1346, 479-487. https://doi.org/10.1007/978-3-030-70416- 2_62
Ram Mohan Rao, P., Murali Krishna, S. & Siva Kumar, A.P. (2018 Sept 22). Privacy preservation techniques in big data analytics: a survey. Journal of Big Data, 5(33), https://doi.org/10.1186/s40537-018-0141-8
Rouse, M. (2022 Oct 6). Data in Use. Techopedia Dictionary: Data Management. Retrieved 2024 Aug 18 from https://www.techopedia.com/definition/29515/data-in-use
Sobb, T., Turnbull, B., & Moustafa, N. (2023). A Holistic Review of Cyber–Physical–Social Systems: New Directions and Opportunities. Sensors (14248220), 23(17), 7391. https://doi.org/10.3390/s23177391
Srivastava, S. & Singh, Y.N. (2021 Nov 11). Social Media Big Data Analytics: Security Vulnerabilities and Defenses. Springer, Singapore: Machine Vision and Augmented Intelligence—Theory and Applications. Lecture Notes in Electrical Engineering. 796, 245-253. https://doi.org/10.1007/978-981-16-5078-9_21
Stouffer, C. (2023 Jul 18). What is encryption? How it works + types of encryption. Norton Blog: Privacy. Retrieved on 2024 Aug 18 from https://us.norton.com/blog/privacy/what-is-encryption
Sutherland, K.E. (2020 Dec 22). Social Media Monitoring, Measurement, Analysis and Big Data. In: Strategic Social Media Management. Palgrave Macmillan, Singapore. 7, 133- 172. https://doi.org/10.1007/978-981-15-4658-7_7
Tomás, J., Rasteiro, D., & Bernardino, J. (2022). Data Anonymization: An Experimental Evaluation Using Open-Source Tools. Future Internet, 14(6), 167. https://doi.org/10.3390/fi14060167
Comentarios