Tumgik
#tdwi
officialtele2n · 1 year
Text
TELE2N PROGRAMMING LIST
CURRENT: Back to Detention (unofficial Detentionaire sequel) Disventure Camp S3 Genetix.TV Total Dangan Island: Total Happy Havoc Total Drama Warrungai Island
UPCOMING: Total Dangan Island: Hello Drama Total Dangan Island: Voting Harmony
2 notes · View notes
satvikasailu6 · 1 month
Text
The Importance of Data Quality in AI Projects: Key Practices for Success
Data quality is the backbone of any successful AI project. High-quality data ensures that AI models are accurate, reliable, and unbiased, which is crucial for making informed decisions and achieving desired outcomes. On the flip side, poor data quality can lead to incorrect predictions, flawed insights, and costly mistakes. In fact, Gartner estimates that poor data quality costs organizations an average of $15 million annually, primarily due to inefficiencies and missed opportunities. The stakes are even higher in AI, where inaccurate data can result in significant financial losses and reputational damage.
A McKinsey report underscores that continuous data health monitoring and a data-centric approach are essential for unlocking AI’s full potential. This highlights the necessity of ongoing data quality management. Maintaining high data quality is not just a best practice—it's a critical requirement for the success and sustainability of AI projects.
Understanding Data Quality in AI
Data quality refers to how accurate, complete, reliable, and relevant a dataset is for its intended use. In AI, high-quality data directly impacts the performance and accuracy of models.
Common Data Quality Issues in AI Projects
AI projects often face issues such as data inconsistency, incomplete datasets, and data bias. For instance, Zillow's home-buying algorithm failed due to outdated and inconsistent data, leading to overpayments and significant financial losses. This case illustrates the critical need for up-to-date and accurate data in AI models to avoid costly errors.
Similarly, a mining company developing a predictive model for its mill processes faced challenges due to data being analyzed only once before storage. This lack of continuous monitoring resulted in unreliable predictions. By implementing real-time data health monitoring, the company improved its data quality and prediction accuracy.
Best Practices for Ensuring Data Quality in AI
Implement Data Governance Frameworks A robust data governance framework establishes policies, procedures, and standards for data management, ensuring consistency and accountability. Key components include data stewardship, quality metrics, and lifecycle management. According to IDC, organizations with strong data governance frameworks see a 20% improvement in data quality.
Data Profiling and Cleansing Data profiling examines data to understand its structure and quality, while data cleansing corrects inaccuracies. Effective profiling and cleansing can significantly enhance data quality. For instance, a financial institution reduced data errors by 30% through these practices.
Continuous Data Monitoring and Validation Regularly checking and validating data ensures it remains accurate and reliable. Advanced tools like data observability platforms can automate this process, offering real-time insights and early detection of issues. Continuous monitoring helps prevent costly downstream effects.
Data Integration and ETL Best Practices Standardizing data formats and validating data during the ETL (Extract, Transform, Load) process are crucial. Proper ETL practices can prevent data loss and corruption, leading to a 25% increase in data accuracy, as reported by TDWI.
Utilizing AI and Machine Learning for Data Quality Management AI and ML technologies can automate the detection and correction of data anomalies, enhancing data quality management. AI-powered tools can identify patterns and trends, enabling proactive quality management. By 2025, AI-driven data quality solutions are expected to become a standard in the industry.
Data Quality Metrics and KPIs Measuring data quality through metrics such as accuracy, completeness, consistency, and timeliness is essential. Setting and monitoring these metrics helps evaluate the effectiveness of data quality initiatives, guided by industry benchmarks from DAMA International.
Ensuring high data quality is crucial for the success of AI projects. By implementing robust governance frameworks, profiling and cleansing data, continuously monitoring quality, following ETL best practices, leveraging AI technologies, and setting quality metrics, organizations can overcome data challenges and achieve superior AI outcomes.
Referred by Datagaps
#DataOpsSuite
Request a demo today
Demo: https://www.datagaps.com/request-a-demo/#utm_source=youtube&utm_medium=yt_video&utm_campaign=yt_request_demo&utm_id=yt_request_demo
0 notes
salvatoretirabassi · 3 months
Text
As I delve into the world of machine learning, I find myself drawn to practical techniques that yield real-world results. David Langer's emphasis on decision trees and random forests resonates with me, as these methods are both powerful and easy to learn. I encourage you to explore Langer's four-hour TDWI course to master the basics of Python and unlock the full potential of machine learning in your organization. Learn more and register for the course by June 26, 2024, and join the machine learning revolution #MachineLearning #PythonForAnalytics https://tdwi.org/Articles/2024/06/05/ADV-ALL-Real-World-Techniques-for-Machine-Learning-David-Langer.aspx
0 notes
industry212 · 4 months
Text
What are the top 10 certifications to excel in business intelligence?
Tumblr media
Achieving certifications in business intelligence (BI) can significantly enhance your skills and credibility in the field, opening up new career opportunities and validating your expertise to employers. Here are the top 10 certifications to excel in business intelligence:
1. Certified Business Intelligence Professional (CBIP)
Provider: Data Warehousing Institute (TDWI)
Key Features:
Comprehensive certification covering various BI disciplines, including data integration, analytics, and data management
Validates expertise in BI concepts, methodologies, and best practices
Consists of multiple levels (Foundation, Practitioner, Master) to accommodate different skill levels and career stages
2. Microsoft Certified: Azure Data Engineer Associate
Provider: Microsoft
Key Features:
Focuses on designing and implementing data solutions on the Microsoft Azure platform
Validates skills in data storage, data processing, and data visualization using Azure services like Azure SQL Database, Azure Data Factory, and Power BI
Demonstrates proficiency in building scalable and reliable BI solutions on cloud infrastructure
3. Tableau Desktop Specialist
Provider: Tableau
Key Features:
Entry-level certification focused on Tableau Desktop, a leading BI and data visualization tool
Validates proficiency in creating basic visualizations, organizing data, and sharing insights using Tableau Desktop
Ideal for beginners looking to establish foundational skills in data visualization and analysis
4. Qlik Sense Business Analyst Certification
Provider: Qlik
Key Features:
Designed for business analysts and data professionals using Qlik Sense, a popular BI and data analytics platform
Covers topics such as data modeling, visualization design, and data storytelling
Validates skills in leveraging Qlik Sense to analyze data, generate insights, and make data-driven decisions
5. IBM Certified Data Architect – Big Data
Provider: IBM
Key Features:
Focuses on designing and implementing big data solutions using IBM technologies such as IBM BigInsights and IBM Db2
Validates skills in data modeling, data architecture, and data integration in the context of big data environments
Demonstrates proficiency in leveraging IBM tools to solve complex data challenges
6. SAS Certified BI Content Developer
Provider: SAS
Key Features:
Designed for individuals developing BI content using SAS software, including SAS Visual Analytics and SAS Visual Statistics
Validates skills in data exploration, report building, and dashboard creation using SAS BI tools
Demonstrates proficiency in leveraging SAS software to deliver actionable insights and drive decision-making
7. Oracle Business Intelligence Foundation Suite 11g Certified Implementation Specialist
Provider: Oracle
Key Features:
Focuses on implementing and configuring Oracle Business Intelligence Enterprise Edition (OBIEE) solutions
Covers topics such as metadata modeling, report development, and dashboard design using OBIEE
Validates skills in deploying and managing BI solutions using Oracle technologies
8. MicroStrategy Certified Analyst
Provider: MicroStrategy
Key Features:
Designed for individuals analyzing data and building reports using MicroStrategy software
Validates skills in data exploration, dashboard creation, and advanced analytics using MicroStrategy tools
Demonstrates proficiency in leveraging MicroStrategy to derive insights and drive business outcomes
9. SAP Certified Application Associate – Business Intelligence with SAP BW 7.5 & SAP BI 4.3
Provider: SAP
Key Features:
Focuses on business intelligence concepts and tools within the SAP ecosystem, including SAP BW (Business Warehouse) and SAP BusinessObjects
Validates skills in data modeling, report creation, and data visualization using SAP BI tools
Demonstrates proficiency in leveraging SAP technologies to deliver actionable insights and support decision-making
10. Google Data Analytics Professional Certificate
Provider: Google (Coursera)
Key Features:
Entry-level certification covering foundational concepts in data analytics and BI
Consists of a series of courses covering topics such as data analysis, data visualization, and data cleaning using tools like SQL, Python, and Google Analytics
Ideal for beginners looking to kickstart their career in data analytics and business intelligence
Considerations for Choosing Certifications
Relevance to Career Goals: Choose certifications aligned with your career aspirations and areas of interest within the field of business intelligence.
Industry Recognition: Prioritize certifications that are widely recognized and valued by employers in the industry.
Skill Level: Consider your current skill level and experience when selecting certifications, opting for entry-level certifications for beginners and more advanced certifications for seasoned professionals.
Learning Format: Evaluate the learning format (e.g., self-paced online courses, instructor-led training) and duration of the certification programs to ensure they fit your learning preferences and schedule.
By earning certifications in business intelligence, you can demonstrate your expertise, stay competitive in the job market, and unlock new opportunities for career advancement in the dynamic field of data analytics and BI.
Read More Blogs:
Crypto Casinos Are Disrupting Traditional Online Games
Crypto Casino Investments: Assessing the Risks and Rewards
Top 5 New Crypto Coins with Potential to Soar in 2024
0 notes
valutric · 2 years
Text
Organizations Struggle with Time to Value for BI
New research from TDWI highlights the obstacles analytics projects face and suggests best practices for overcoming those challenges.Many enterprises have invested heavily in business intelligence (BI) and analytics solutions, but they aren't always seeing the benefits they had hoped to achieve with those solutions. More specifically, they say that it takes too long to achieve value when starting a new analytics project. That is one of the key findings of a new study titled "Accelerating the Path to Value with Business Intelligence and Analytics," published When asked to prioritize their goals for their BI and analytics efforts, survey respondents ranked "reduce project time to value" number one. That wasn't a surprise for David Stodder, senior director research, business intelligence at TDWI and the author of the report. "I hear all the time that organizations want to get to value faster," he said. However, when asked about their satisfaction with the current time to value of their projects, only 10% said they were "very satisfied," and nearly half (45%) were either "somewhat unsatisfied" or "not very satisfied." The reason for that dissatisfaction became clearer when respondents were asked to compare their current time to value with past performance. Only 8% of those surveyed said projects are delivering value much faster than they were a year ago. Read the full article
0 notes
trylkstopocket · 2 years
Text
Cloud data lakes
Introduction
According to a December 2021 report from TDWI on “Data Engineering and Open Data Lakes,” the software industry is witnessing a massive shift from cloud data warehouses to cloud data lakes because of the data lake’s superior flexibility. They fulfill a promise that has been long in the making: The need for a vastly scalable solution that can easily ingest, integrate, analyze, share, and secure any amount of data, in just about any format, without requiring the data to be modeled or stored in a predefined structure. This flexibility allows data professionals to “load data first and ask questions later,” broadening the horizons of business intelligence, predictive analytics, application devel- opment, and other data-driven initiatives.
However, despite continued enthusiasm for the data lake para- digm, poorly constructed data lakes can easily turn into data swamps — unorganized pools of data that are difficult to use, understand, and share with business users. This has been happening for more than a decade. To mitigate this risk, the most advanced cloud data lakes are created on top of cloud data platforms — scalable solutions that combine everything great about data warehouses, data lakes, and other key workloads into one cohesively managed solution.
Cloud Data Lakes For Dummies, Summarized
The essential aspects of data management: Prioritizing data governance, data security, and data privacy.
Chapter 1: Introducing Cloud Data Lakes
IN THIS CHAPTER
Flowing data into lakes
Acknowledging the limitations of traditional data lakes
Discussing the pros and cons of cloud object storage
Introducing modern cloud data lakes
Looking at who uses modern data lakes and why
This chapter digs into the history of the data lake. It explains why this type of data repository emerged, what data lakes can do, and why traditional data lakes have fallen short of the ever-expanding expectations of today’s data professionals.
Flowing Data into the Lake
What’s behind the name data lake? Picture data streaming in from many different sources, all merging into one expansive pool.
Now, compare that vision to the function-specific “ponds” that characterize special-purpose data management systems, such as data warehouses and data marts designed explicitly for finance, human resources, and other lines of business. These siloed ana- lytic systems typically load structured data into a predefined schema, such as a relational database, and easily accessible via Structured Query Language (SQL) — the standard language used to communicate with a database. By contrast, the hope for data lakes was to store many types of data in their native formats to facilitate ad hoc data exploration and analysis. In addition to the orderly columns and rows of relational database tables, these data lakes would store semi-structured and unstructured data, and make that data available to the business community for reporting, ana- lytics, data science, and other pressing needs (see Figure 1-1).
Understanding the Problems with Traditional Data Lakes
Data lakes arose to supplement data warehouses because the rela- tional model can’t easily accommodate today’s diversity of data types and their fast-paced acquisition models. While data ware- houses are generally designed and modeled for a particular pur- pose, such as financial reporting, data lakes don’t always have a predetermined use case. Their utility becomes clear later on, such as when data scientists conduct data exploration for feature engi- neering and developing predictive models.
While this data discovery process opens up near-limitless poten- tial, few people anticipated the management complexity, lack- luster performance, limited scaling, and weak governance that characterized these open-ended data lake implementations.
Part of the problem was the inherent complexity of these early data lakes. The core technology was based on the Apache Hadoop ecosystem, an open source software framework that distributes data storage and processing among commodity hardware located in on-premises data centers.
Many of these data lake projects failed to fulfill the promise of data lake computing due to expensive infrastructure, slow time to value, and extensive system management requirements.
The inherent complexities of a distributed architecture and the need for custom coding for data transformation and integration, mainly handled by highly skilled data engineers, made it difficult to derive valuable insights and outcomes. It was easy to load and store huge amounts of data in many different formats but difficult to obtain valuable insights from that data.
Acknowledging Interim Solutions: Cloud Object Stores
In the years since data lakes were first introduced, cloud com- puting has evolved, and data storage technologies have matured considerably. Many organizations now leverage object storage services, such as Amazon Simple Storage Service (S3), Microsoft Azure Blob Storage, and Google Cloud Storage, as attempts to create their own data lakes from scratch.
Not having to create or manage compute clusters and storage infrastructure, as was necessary with Hadoop, is a big step for- ward. However, cloud object stores don’t offer a total data lake solution either. For example, although customers no longer have to provision and scale a distributed hardware stack, they still have to create, integrate, and manage complex software environ- ments. This involves setting up procedures to access and some- times transform data, and establishing and enforcing policies for data security, data governance, identity management, and other essential activities. Finally, customers have to figure out how to achieve adequate performance for a variety of codependent ana- lytic workloads, such as business intelligence, data engineering, and data science, all of which may compete for the same pool of compute and storage resources.
Other common problems include difficulty managing and scal- ing the environment, and inadequate procedures for managing data quality, security, and governance. Without attention to these complex issues, even well-constructed data lakes can quickly become data swamps. The greater the quantity and variety of data, the more significant this problem becomes. That makes it harder to derive meaningful insights, as depicted in Figure 1-2.
Cloud object stores allow organizations to store and analyze unlimited amounts of data in their native formats. However, that leaves organizations to take charge of data management, data transformation, data protection, data governance, data compli- ance, and many other complex activities.
Reviewing Modern Requirements
Despite these early failings, the original promise of the data lake remains: a straightforward and powerful way for organizations to collect, store, integrate, analyze, and share their data from a single repository. They want to explore, refine, and analyze petabytes of data without a predetermined notion of the data’s structure.
Most of today’s data lakes, however, can’t effectively organize all of that data, let alone properly secure and govern that data.
To be truly useful, a data lake must include a cohesive set of tools that reveal what is in the data lake, who is using which data, and how that data is being used, along with assurances that all data is protected. It must also store data in their native formats and facilitate user-friendly data exploration, automate routine data management activities, and support a broad range of use cases and workloads, such as modern data sharing. What’s more, the data lakes of today must be fed by a number of data streams, each of which delivers data at a different frequency, without imposing onerous requirements on the data engineering teams that build these data pipelines. And it must handle all of this without any storage or performance limitations.
To meet these needs, a third and far better data lake paradigm has arisen. These solutions have become the foundation for the modern data lake: A cloud-built repository where structured, semi-structured, and unstructured data can be staged in their raw forms — either in the data lake itself or in an external object storage service.
Anchored by a cloud data platform, these newer data lakes pro- vide a harmonious environment that blends many different data management and data storage options, including a cloud analyt- ics layer, a data warehouse, and a cloud-based object store. With the right software architecture, these data lakes provide near- unlimited capacity and scalability for the storage and computing power you need. They make it easy to derive insights, obtain value from your data, and reveal new business opportunities.
Explaining Why You Need a Modern Cloud Data Lake
This book reveals how to create innovative, cost-effective, and versatile data lakes  — and extend existing data lakes created using Hadoop, cloud object stores, and other limiting technolo- gies. It relies on a modern architecture that is secure, resilient, easy to manage, and supports many types of users and workloads.
In addition to anchoring a versatile data lake, standardizing on a well-architected cloud data platform has many other advantages. For example, it makes it easy to share data with authorized users without requiring database administrators to copy that data or establish a new data silo, all while upholding centralized data security and governance policies. It makes it easier to accom- modate new design patterns, such as a data mesh, and integrate new data formats, such as Apache Iceberg tables. And yet, even with this diversity, the entire environment can be operated with familiar SQL tools, while data professionals can also use their chosen languages, tools, open source libraries, and development frameworks. A consumption-based pricing model should accom- pany the data lake to ensure each user and team only pays for the precise compute and storage resources they use. Best of all, a modern cloud data platform should operate seamlessly across multiple public clouds via one consistent management interface, so your DevOps team can ensure maximum continuity, and your organization will never be limited to one single cloud provider.
As on-premises data lakes decline in popularity and cloud object stores show their limitations, new architectural paradigms based on cloud data platforms are revealing their potential. Because all storage objects and necessary compute resources are internal to the platform, data can be accessed, analyzed, modeled, and manipulated quickly and efficiently. This is much different from the original data lake architectures, where data was always stored in an external data bucket and then copied to another loosely integrated storage-compute layer to achieve adequate analytics performance.
Looking at Which Industries Use Modern Data Lakes and Why
Modern cloud data lakes can play an important role in every indus- try. For example, ecommerce retailers use modern data lakes to collect clickstream data for monitoring web-shopping activities. They analyze browser data in conjunction with customer buying histories to predict outcomes. Armed with these insights, retail- ers can provide timely, relevant, and consistent messaging and offers for acquiring, serving, and retaining customers.
Oil and gas companies use data lakes to improve geological explo- ration and make their extraction operations more efficient and productive. Data from hundreds or thousands of sensors helps oil and gas companies discover trends, predict equipment failures, streamline maintenance cycles, and understand their operations at very detailed levels.
Banks and financial services companies use data lakes to analyze market risks and determine which products and services to offer. In much the same way, nearly all customer-focused organizations can use data lakes to collect and analyze data from social media sites, customer relationship management (CRM) systems, and other sources, both internal to the company and via third-party data services. They can use all that data to gauge customer senti- ment, adjust go-to-market strategies, mitigate customer support problems, and create highly personalized experiences for cus- tomers and prospects.
Traditional data lakes fail because of their inherent complexity, poor performance, and lack of governance, among other issues. By leveraging the capabilities of a cloud data platform, modern data lakes overcome these challenges.
Foundational tenets of these versatile, high-performance data lakes include:
No data silos: Easily store and access petabytes of struc- tured, semi-structured, and unstructured data from a single platform, even across multiple clouds, in a cohesive way.
Fast and flexible: Allow developers and other experts to work with data in their preferred languages. For example, data engineers can process data with Java, data scientists can run models in Python, and analysts can query with SQL.
Instant elasticity: Supply nearly any amount of computing resources to any user or workload. Dynamically change the size of a compute cluster without affecting running queries, or scale the service to easily include additional compute clusters to complete intense workloads faster.
Concurrent operation: Deploy to a near-unlimited number of users and workloads to access a single copy of your data, all without affecting performance. For example, you may merely want to run analytical queries, then later allow developers to build data-intensive applications.
Inherent control: Present fresh and accurate data to users, focusing on data sharing and collaboration, data quality, access control, and metadata management.
Reliable: Confidently combine data to enable multi-statement,^ ACID transactions.
Fully managed: The data platform automates many aspects of data provisioning, data protection, security, backups, and performance tuning, allowing you to focus on analytic endeavors rather than on managing hardware and software.
INSTANT ELASTICITY FOR HEALTHCARE ANALYTICS
Scripps Health is a nonprofit healthcare system based in San Diego, California, that includes 5 acute-care hospital campuses and 28 outpa- tient centers and clinics. It has more than 16,000 employees and treats 600,000 patients annually via 3,000 affiliated doctors. Previously, Scripps relied on an on-premises Hadoop cluster and a legacy data warehouse platform for healthcare analytics. The system supported several data warehouse use cases but required a special- ized IT team to develop, administer, scale, and tune it for adequate performance. The team phased out the Hadoop cluster and sub- scribed to a modern cloud data platform and cloud blob storage. Today, Scripps stores high-priority, or “hot,” data in the cloud data platform, and stores archival “cold” data in cloud blob storage. By adopting this low-maintenance environment, Scripps achieved a 50 percent reduction in full-time equivalent (FTE) staff dedicated to database administration, and reduced its software licensing costs by 60 percent. Previously, users retrieved data and analyzed it on their own systems, creating data silos. Now, users access data through the intuitive interface of the cloud data platform, eliminating those silos. The cloud data platform distinctly separates but logically integrates storage and compute resources into independently scalable entities, enabling Scripps Health to scale capacity up and down as needed. Each business unit pays only for the compute resources it consumes, and a data-masking feature masks plain-text data at query time for stronger security, an important factor when dealing with patient data and personally identifiable information (PII).
Now that Scripps Health has a robust modern cloud data lake, with repeatable processes for a variety of data-intensive workloads, it is experimenting with developing predictive analytics, building statistical models, and retrieving data using a standard ODBC connection. As a fully managed cloud solution, near-zero maintenance frees data pro- fessionals at Scripps Health to focus on revealing fresh insights to advance strategic business initiatives.
CHAPTER 2: Enabling Modern Data Science and Analytics
IN THIS CHAPTER
Boosting team productivity
Supporting popular data science tools
Building on the right architecture
Accommodating many different workloads
For many data science initiatives, data lakes are the reposito- ries of choice. However, managing data in today’s data lakes is fraught with difficulty. According to an Anaconda report titled “The State of Data Science 2020: Moving from Hype Toward Maturity,” data scientists spend an average of 45 percent of their time preparing data in the data lake before they can use it to develop machine learning (ML) models and visualize the out- comes in meaningful ways.
This chapter describes the three fundamental attributes of a data lake that help ensure successful data science and other types of analytic endeavors:
The capability to seamlessly combine and easily access^ multiple types of data, all stored in one universal repository
The freedom for data scientists to collaborate using their chosen tools, frameworks, libraries, and languages
An architecture that allows data scientists, business analysts, and other data professionals to collaborate productively over data without having to contend for compute and storage resources
Establishing a Data Foundation
Data lakes were born out of the necessity of big data analytics. These multipurpose repositories provide the technology organi- zations need to store data until data scientists discover potential uses and applications. However, traditional data lakes can be dif- ficult to secure, govern, and scale. They may also lack the crucial metadata data scientists need to make sense of the information. Metadata is data about data.
A cloud data platform resolves these issues by providing a nat- ural structure for many types of data. In addition to capturing raw data, as is common for a data lake, it stores and manages the metadata that allows data scientists to conduct meaningful analyses, such as tagging fields in a document and categorizing patterns within images. Having a common metadata layer also helps various data users collaborate with the data by ensuring accurate, consistent results when the data is displayed through dashboards and reports.
The services layer is the linchpin of a modern cloud data platform. It manages metadata, transactions, and other operations. It per- forms these activities locally or globally across multiple regions and clouds, enforcing centralized security and governance as it tracks, logs, and directs access to every database element and object within the data lake.
Boosting Team Productivity
A properly architected data lake supports multiple business units and workloads, with one centralized repository rather than mul- tiple data silos serving discrete needs. The data platform enables a single dynamic copy of the data that can populate and update ML models, business intelligence (BI) dashboards, and predictive analytic apps. It also orchestrates analytics, data sharing, data ingestion, and data science.
This architecture allows data professionals to easily process data relevant to their sphere of operations. Whether creating data pipelines, conducting feature engineering, developing data appli- cations, issuing queries, or setting up data-sharing relationships, all teams can collaborate on a unified, shared repository of data. This synergy is especially valuable for data science teams. Con- solidating data into one central location streamlines the data sci- ence workflow by facilitating collaboration among all workflow participants, including data scientists, data engineers, and ML engineers.
Having a complete services layer is what makes a data lake use- ful. It rationalizes differences among various data types, so people don’t have to look for it in multiple places. It applies centralized security and governance, even when the data set spans multi- ple clouds and multiple regions. This eliminates the inconsistent results that arise when various work groups use different copies of the data.
Supporting Languages and Tools
Today’s data science teams use a broad range of software tools, algorithms, open source libraries, and ML principles to uncover business insights hidden in vast volumes of data. Whether writ- ing queries, building data pipelines, or embedding custom logic in a software program or procedure, it should be simple for data professionals to interact with the platform directly, without hav- ing to move data from one database to another. These highly paid workers are most productive when they can collaborate on a single shared version of data, upholding universal security constraints, even when they use multiple tools.
To allow all types of data professionals to work productively, your data lake must support popular ML frameworks and languages. Data engineers commonly use SQL, Python, and Java to prepare data. Data scientists use Python, Structured Query Language (SQL), and R to explore data relationships, conduct feature engineering, and train ML models. Ideally, your data lake should enable a data frame style of programming preferred by many technology experts, which aligns data into a two- dimensional array, much like the structured rows and columns of a relational database or spreadsheet.
When your data platform is architected to support multiple teams and workloads without forcing each team to contend for resources, the entire data science practice becomes timelier and more pro- ductive. Data scientists output the results of ML activities back into the data platform for general-purpose analytics, even as data engineers load data and business analysts consume it. A common repository allows BI apps to leverage the results of data science initiatives and put the data to work throughout the business. It also ensures reliable outcomes: All front-end apps reference the same back-end data definitions, ensuring consistent results for queries, forecasts, dashboards, and reports.
Accommodating Multiple Workloads and Communities
With a traditional data platform, fixed compute and storage resources limit concurrency — the capability for many users to deploy many data workloads simultaneously. A cloud data platform built on a multi-cluster, shared data architecture scales compute and storage resources independent from each other and near-infinitely. This allows multiple users to query the same data without degrading performance, even as other workloads operate simultaneously, such as ingesting data or training an ML model.
A well-architected data lake also allows data users to combine data generated by an organization with third-party data sets, such as those acquired from its business partners or purchased from data marketplaces. By doing so, the organization gains previously unobtainable insights about its business and its customers. The enriched data and the insights it generates also create new market opportunities in the form of monetizing data and data applica- tions that extend data science learnings to internal and external communities. An organization can offer these commercial data sets and data applications to customers and partners and also list them on a data marketplace. A modern cloud data platform should offer the capability to connect to a cloud data marketplace, where an ecosystem of third-party data, technology, and data service providers bring additional data, tools, and services into the eco- system, broadening what’s possible for data science teams.
A cloud data platform makes a data lake more useful. It fosters collaboration and ensures the entire organization has a scalable data environment for data science and related analytic endeav- ors. For example, data scientists and ML engineers can access raw data straight from the data lake for feature engineering and modeling activities while business analysts generate reports via self-service dashboards — all without degrading performance.
START WITH THE BEST ARCHITECTURE
A multi-cluster, shared data architecture includes three layers that are logically integrated yet scale independently from one another:
Storage: A single place for all structured, semi-structured, and unstructured data
Compute: Independent computing resources dedicated to each workload to eradicate contention for resources
Services: A common services layer that handles infrastructure, security, metadata, query optimization, and much more
OPTIMIZED PRICING, STORE LOCATION, AND SUPPLY CHAIN VIA DATA SCIENCE
Żabka owns the largest chain of convenience stores in Poland, with more than 7,000 stores visited daily by more than 2.5 million custom- ers. With millions of daily transactions, the amount of data soon over- whelmed Żabka’s on-premises data warehouse and data lake Hadoop cluster, making it impossible for data scientists to load and analyze data concurrently.
With its previous data lake environment, Żabka could analyze transac- tion data to optimize prices for each store, but data engineers and business analysts had to share a finite set of compute and storage resources. Meanwhile, Żabka’s data scientists wanted to create new ML models to enable a more advanced product pricing strategy but could only work at certain times of the day.
Żabka switched to a modern cloud data platform to establish a mod- ern data environment that anchors both the data warehouse and a new data lake, transforming its ability to make data-driven decisions. The new data platform allows each of these teams to instantly add additional computing power during high-traffic hours so they can load data, run queries, refine models, and generate reports as needed. For example, data scientists pulled data from the data lake to identify 14 consistent store segments, including internal data on transactions, marketing promotions, and assortments. They combined this data with dozens of external data sets containing the prices and locations of competitors, upcoming events, geographic coordinates, and demo- graphic information.
These advanced data science models have allowed Żabka to optimize pricing for each product in each store, which has increased revenue and margins. In addition, a new revenue-estimation model allows team members to determine the most effective locations for new stores. Żabka can also share this near real-time data with suppliers to increase sales, personalize consumer communication, and perform market research. Insights gathered from its Poland stores will be valu- able for expanding into other countries.
Chapter 3 Reducing Risk, Protecting Diverse Data
IN THIS CHAPTER
Planning your data lake implementation
Complying with privacy regulations
Establishing comprehensive data security
Improving data retention, protection, and availability
Your organization’s data is incredibly valuable, and this book is all about maximizing that value with the latest technolo- gies for storing, analyzing, and gaining useful insights from that data. However, your data is also valuable to bad actors who are continually unleashing malware viruses, phishing schemes, and other nefarious plots designed to steal or compro- mise your data assets. In the process, they may force your organi- zation to pay a ransom to call off the attack. According to a recent report from Cybersecurity Ventures, ransomware costs are expected to reach $265 billion by 2031, while global cybercrime costs will grow 15 percent per year over the next five years, reach- ing $10.5 trillion annually by 2025.
This growing risk of malicious attacks is compounded by inter- nal threats, mishaps, and compliance violations, often stem- ming from simple errors, omissions, or failure to apply software patches in a timely manner. This chapter discusses the need to plan carefully and deliberately as you set up your data lake to deliver the best data security, privacy, and regulatory compliance.
Facing Facts about Data Security
If you entrust your data to a cloud provider or software-as-a- service (SaaS) vendor, will they keep it secure? In the early days of cloud computing, this was a hotly debated topic. Today, the superiority of cloud security is one of the motivating factors that encourages organizations to put their data in the cloud. Cloud providers such as Amazon, Microsoft, and Google have estab- lished sophisticated security operation centers (SOCs) staffed by elite teams of IT professionals trained in the most current cyber- security practices. Reputable SaaS providers have followed suit. As a result, a well-architected and properly maintained cloud data lake can be more secure than the data warehouses and data lakes that you host in your own data center.
All aspects of a data lake — its architecture, implementation, and operation — must center on protecting your data. Your data security strategy should include data encryption and access con- trol, in conjunction with comprehensive monitoring, alerts, and cybersecurity practices. You must also monitor and comply with data privacy regulations that govern the use and dissemination of customer data.
However, ensure you understand precisely what your data plat- form vendor provides. Security capabilities vary widely among vendors. And although they might have good security, they differ in their degree of automation and assistance. Some cloud vendors automate only rudimentary security capabilities, leaving many aspects of data encryption, access control, and security monitor- ing to the customer. Others handle these tasks for you.
Effective security can be complex and costly to implement. Cybersecurity professionals are hard to come by. Instead of build- ing an in-house security operations center from scratch, if you subscribe to a modern cloud data platform with automated secu- rity capabilities, you can achieve a high level of data protection as soon as you enable the data platform.
Encrypting Data Everywhere
Encrypting data, which means applying an encryption algorithm to translate the clear text into ciphertext, is a fundamental secu- rity feature. Data should be encrypted both “at rest” and “in transit,” meaning when the data is stored on disk when moved into a staging location for loading into the data lake, when it is placed within a database object in the data lake itself, and when it is cached within a virtual data lake. Query results must also be encrypted.
End-to-end encryption should be the default, with security methods that keep the customer in control, such as customer- managed keys. This type of “always on” security is not a given with most data lakes, as many highly publicized on-premises and cloud security breaches have revealed.
Managing Encryption Keys
After you encrypt your data, you’ll decrypt it with an encryption key (a random string of bits generated specifically to scramble and unscramble data). To fully protect the data, you must protect the key that decodes your data. A robust data lake should handle data encryption and key management automatically, all the time, for all data, when it is in transit and at rest.
The best data lakes employ AES 256-bit encryption with a hierar- chical key model rooted in a dedicated hardware security module to add layers of security, protection, and encryption. They also instigate key-rotation processes that limit the time during which any single key can be used. Data encryption and key management should be entirely transparent to the user but not interfere with performance.
Automating Updates and Logging
Cybersecurity is never static. The security measures you apply to your data lake must evolve to reflect today’s dynamic threat land- scape. That means always keeping up with security patches that address known threats.
Ideally, these security updates should be applied automatically to all pertinent components of the cloud data platform as soon as those updates are available. If you use a cloud provider, that ven- dor should also perform periodic security testing (also known as penetration testing ) to proactively check for security flaws. These safeguards should not impact your daily use of the cloud data platform.
As added protection, verify your data lake vendor uses file integ- rity monitoring (FIM) tools, which ensure critical system files aren’t tampered with. All security events should be automati- cally logged in a tamper-resistant security information and event management (SIEM) system. The vendor must administer these measures consistently and automatically, and they must not affect query performance.
Controlling Access to Sensitive Data
All users must be authorized before accessing or manipulating data in the data lake. For authentication, ensure your connec- tions to the data platform provider leverage standard security technologies, such as Transport Layer Security (TLS) 1.2 and IP whitelisting. (A whitelist is a list of approved email addresses or domain names from which an email-blocking program will allow messages to be received.) A cloud data lake should also support the SAML 2.0 standard so you can leverage your existing pass- word security requirements and existing user roles. Regardless, multifactor identification (MFA) should be required to prevent users from logging in with stolen credentials. With MFA, users are challenged with a secondary verification request, such as a onetime security code sent to a mobile phone.
After a user has been authenticated, it’s important to enforce authorization to specific parts of the data based on that user’s “need to know.” A modern data lake must support multilevel, role-based access control (RBAC) functionality so users requesting access to the data lake are authorized to access only the data they are explicitly permitted to see.
In addition to this basic authentication, fine-grained access control allows database administrators to apply security constraints and rules to certain parts of each object, such as at the row level and column level within a database table. Access constraints can also be applied to compute servers to control which users can execute large data processing jobs. Another useful feature is geofencing, which allows the administrator to set up and enforce access restrictions based on the users’ location.
As you add semi-structured and unstructured data to your data lake, other important stipulations apply. Granular access control becomes more difficult with the file-based storage often found in a data lake (see the “Common Ways to Store Data” sidebar), which doesn’t conform to a tabular structure. With many of today’s object stores, security may be “all or nothing”: You either have access to the storage layer or don’t. To bolster this basic security, your data lake provider should apply fine-grained RBAC measures to all database objects, including tables, schemas, and any virtual extensions to the data lake.
In some instances, you can also use secure views to prevent access to highly sensitive information most users don’t need to see. This security technique allows you to selectively display some or all the fields in a table, such as only allowing HR professionals to see the salary fields in an employee table.
Complying with Data Privacy Regulations
For sensitive data, such as tables that populate financial reports or columns that contain personally identifiable information (PII), knowing where data resides within your data lake is critical to satisfying regulatory compliance requirements. Privacy regula- tions are increasingly rigorous, and organizations can’t ignore them. Leading the way are Europe’s General Data Protection Regulation (GDPR), the United States’ Health Insurance Porta- bility and Accountability Act (HIPAA), and the California Con- sumer Protection Act (CCPA). Corporate data governance policies should verify data quality and standardization to ensure your data is properly prepared to meet these requirements. The types of information that fall under these specific guidelines include credit card information, Social Security numbers, names, dates of birth, and other personal data.
COMMON WAYS TO STORE DATA
Data lakes use files, blocks, and objects to store and organize data.
File storage organizes data as a hierarchy of files in folders. It is popular for unstructured data such as documents and images, especially when used in low-latency applications such as high- performance computing (HPC) and media processing.
Block storage divides data into evenly sized volumes, each with a unique identifier. It is commonly used for databases that require consistent performance and low-latency connectivity.
Object storage breaks files into pieces that can be spread out among hardware platforms, each object acting as a self-contained repository. It is useful for unstructured data, such as music, video, and image files.
Certifying Attestations
Data breaches can cost millions of dollars to remedy and perma- nently damage customer relationships. Industry-standard attes- tation reports verify that cloud vendors use appropriate security controls and features. For example, your cloud vendors need to demonstrate they adequately monitor and respond to threats and security incidents and have sufficient incident response proce- dures in place.
In addition to industry-standard technology certifications, such as ISO/IEC 27001 and SOC 1/SOC 2 Type II, verify that your cloud provider also complies with all applicable government and indus- try regulations. Depending on your business, this could include Payment Card Industry Data Security Standards (PCI-DSS), GxP data integrity requirements, HIPAA/Health Information Trust Alliance (HITRUST) privacy controls, ISO/IEC 27001 security management provisions, International Traffic in Arms Regula- tions (ITAR), and FedRAMP certifications. Ask your providers to supply complete attestation reports for each pertinent standard, not just the cover letters.
An important stipulation within these data privacy regulations is the right to be forgotten, which means consumers can opt out of communications from merchants or vendors. In these instances, all links to and copies of their PII must be erased from a vendor’s information systems. When all your data is stored in one univer- sal repository that automatically manages metadata and lineage, fulfilling these requests is much easier. With minimal manual intervention, the platform should automatically detect PII and apply the appropriate policies to that information, even as data is loaded, staged, and moved across multiple tables and objects.
Isolating Your Data
If your data lake runs in a multitenant cloud environment, you may want it isolated from all other data lakes. If this added pro- tection is important to you, ensure your cloud data platform ven- dor offers this premium service. Isolation should extend to the virtual machine layer. The vendor should isolate each customer’s data storage environment from every other customer’s stor- age environment, with independent directories encrypted using customer-specific keys.
If your company must adhere to certain data sovereignty require- ments, then investigate the regional penetration of your cloud provider’s coverage. For example, will the provider enable you to maintain sensitive data in specific cloud regions? Can you store encrypted data in the cloud and the encryption keys on premises? These capabilities are especially important in Europe and other highly regulated regions.
Work only with cloud providers that can demonstrate they uphold industry-sanctioned, end-to-end security practices. Security mechanisms should be built into the foundation of the data plat- form. You shouldn’t have to do anything extra to secure your data.
Finally, data security and compliance hinge on traceability. You must know where your data comes from, where it is stored, who has access to it, and how it is used, which Chapter 4 discusses.
REDUCING THE RISK OF SHARING DATA
Portland General Electric (PGE) is a fully integrated energy company with statewide operations in Oregon, serving 1.9 million people in 51 cities.
Previously, PGE managed a legacy, on-premises data warehouse that was expensive to maintain and had performance issues. The system’s tightly coupled architecture was inflexible. In addition, multiple copies of the data proliferated across the organization, making it difficult to identify the authoritative source of data-driven insights.
This environment also increased PGE’s data storage costs. Realizing the need for a modern data environment, PGE selected a cloud data platform as a foundation for a modern data lake that features high performance, separation of storage from compute, near-zero mainte- nance, and an extensive security architecture. Today, the cloud data platform increases security and governance capabilities for PGE’s data lake, including data files stored in cloud object storage. Secure views on external tables keep data in place while providing row-level and column-level access to the data based on user IDs. Users are authenticated via a single sign-on into a Tableau business intelligence environment. Data requested from Tableau dashboards are access-controlled by user-level privileges managed within secure views.
The cloud data platform also supports secure data sharing. Previously, providing data to external partners was a complex pro- cess. Now, PGE can seamlessly provide data to external groups and internal groups, such as its data science team, without copying or moving data. Instead of making and sending static copies of the pro- duction data set, PGE enables data scientists with read-only access to the data that remains in its original location and is updated in near real time with a click of a button. PGE uses secure data sharing and secure views within its cloud data platform to maximize the accessibil- ity of data while minimizing risks.
Chapter 4: Preventing a Data Swamp
IN THIS CHAPTER
Instituting robust data governance
Cataloging and classifying your data
Automatically detecting data schema
Improving data quality
Tracing data lineage
How do you prevent your data lake from becoming a data swamp — a quagmire of unmanageable data? You must start with a data platform that automatically collects meta data and enforces systematic data governance.
Data governance ensures data is properly classified and accessed. Metadata helps you understand exactly what data you have and how people use it.
Your data platform should track who uploads data, when, and what type of data it is. It should also identify key fields and values — a capability that is especially important when dealing with personally identifiable information (PII).
This chapter explains how a modern data lake achieves these objectives with the following essential ingredients: a data cata- log, automatic schema detection, capabilities to track data line- age, and provisions for classifying and adding business context to your data.
Understanding Metadata
A data lake achieves effective governance by keeping track of where data is coming from, who touches the data, and how various data sets relate to one another. A robust cloud data platform auto- matically generates this type of metadata for files in internal stages (that is, within the data lake) and external stages (such as Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage). This metadata is often maintained in virtual columns and queried using standard commands, such as Structured Query Language (SQL) SELECT statements, and loaded into a table with the regular data columns.
Cataloging Data
Much of the data loaded into first-generation data lakes isn’t usable because it hasn’t been cataloged. Picture a storage unit where you stash numerous items — family heirlooms and trea- sures as well as old furniture, mismatched clothing, and castoffs of questionable value. As the years pass, you may not have a clear idea of what is even stored there.
And so it is with a traditional data lake. If no designations and identifiers are on the data, users have difficulty finding and retrieving relevant information — and don’t even know what is there in the first place.
A data catalog helps users sort this out by empowering them to discover and understand the data. Most data catalogs include a self-service portal that makes it easy to view and understand metadata, which improves accuracy and enables more confident decision-making.
Whereas many organizations opt to use an external data catalog, modern data lakes are gradually evolving to include internal cata- logs. Some solutions include directory tables that act as built-in file catalogs.
If you don’t catalog your data, you can quickly end up with a data swamp. A data catalog keeps track of what types of information you have, who can access it, and how popular it is, along with the lineage of the data and how it is used.
Detecting Data Schema
A schema defines how data is organized, structured, and related to other data. Schema objects can include table names, fields, data types, and the relationships among these entities. Whereas a data catalog keeps track of what data you have, a data schema helps you make sense of it.
A data lake that offers automatic schema detection can be help- ful, especially to prepare semi-structured data for querying and analytics. For example, suppose a data engineer wants to create a table from Parquet files. In that case, the data lake can auto- matically register all the fields and data types, either by copying the file data into relational tables (schema on write) or by query- ing the file data in place (schema on read). To make it easier to query and join multiple data sets, look for a data platform that can detect schema in popular file types, such as Parquet, Avro, and ORC formats.
A data lake can accommodate many types of data because it’s not constrained by a predefined schema. However, data analysts need to know the schema of all the data sets and which tables and col- umns represent common entities. Schema detection automates these operations.
Classifying and Contextualizing Data
With such a high velocity and variety of data going into your data lake, how can you keep track of sensitive data and PII to preserve strong customer relationships and avoid compliance violations? For example, if a marketing team is collecting data about custom- ers, they will likely acquire personal information, such as email addresses, phone numbers, and credit card numbers. Modern data lakes should use data classification tools to identify certain types of PII, helping database administrators to classify, control, and monitor its usage. These tools can identify where sensitive data is stored and ensure proper protection and monitoring for a growing number of data types.
Some data platforms can automatically understand the context of each part of the data set, such as when it was created, when it was last modified, and how it fits within the context of your business.
This can help auditors understand which database objects contain PII. Classifying data by department or business function can also help the business allocate costs to particular departments and cost centers.
To maximize the utility of your data lake, you need to know not just where data is located and what types of sensitive data it con- tains but also how and when it is used and by whom. Some data platforms maintain records and metrics that reveal how broadly the data is used. This helps data stewards manage the flow of data from inception to deletion to maximize its value and minimize data management costs. For example, frequently accessed data might be maintained in a high-performance storage environment (“hot” storage), then later placed in less expensive “cold” stor- age for archival purposes once it is being accessed less frequently.
Ensuring Data Quality
Data governance requires oversight to maintain the quality of the data your organization shares with its constituents. Bad data can lead to missed or poor business decisions, loss of revenue, and increased costs. Data stewards — people charged with oversee- ing data quality — can identify when data is corrupt or inaccu- rate, when it’s not being refreshed often enough to be relevant, or when it’s being analyzed out of context.
Ideally, you should assign responsibility for data quality efforts to the business users who own and manage the data because they’re the people in the best position to note inaccuracies and incon- sistencies. These data stewards should work closely with IT pro- fessionals and data engineers to establish data quality rules and processes with full transparency so users can see what changes are made as stewards cleanse the data.
Considering that it’s common to load raw data into a data lake, it’s important to give users the capability to ensure the data is of the quality needed for the tasks at hand. For example, ana- lysts generating reports need very clean data, which could mean removing duplicates and ensuring there are no missing values. However, a network security analyst querying event logs may want data in its rawest form to get a granular view of potential problems in source systems. You can meet these disparate needs by curating data into logical “zones” based on how much the data has been transformed, mapped, modeled, and cleansed.
Building Trust with Data Lineage
With many types of users accessing various logical data zones, and many data pipelines refreshing them with new or trans- formed data, it is easy to lose visibility into the origin of infor- mation. Tracking the data’s lineage helps users make sense of the data by revealing how it flows into the data lake, how it is transformed and manipulated, and where it goes when it flows out of the data lake.
Data lineage tools — either resident in the data platform or avail- able through add-on services — help you understand the jour- ney data follows through all your data-processing systems: what sources the data comes from, where it flows to, and what hap- pens to it along the way. These technologies create a detailed map of all direct and indirect dependencies among data entities. This knowledge can help compliance officers trace the usage of sensi- tive data. It can also help data engineers foresee the downstream impact of changes.
Simplifying the Data Lake Architecture
Proper data governance involves many complementary capabili- ties and tools. With each “point” solution you add to the tech- nology stack, the more challenging it becomes to aggregate all your metadata in a useful way for end users and administrators. A complete data platform should synthesize and integrate these components into one cohesive architecture. Ideally, your data lake should rest on a cloud data platform that integrates all metadata and data governance capabilities into a seamless experience.
Implementing effective governance early in the data lake devel- opment process will help you avoid potential pitfalls, such as poor access control, unacceptable data quality, and insufficient data security.
ENABLING A COMPREHENSIVE DATA STRATEGY
Founded in 1906, CEMEX is a global building materials company that offers cement, ready-mix concrete, aggregates, and urbanization solutions in growing markets around the world, powered by a multinational workforce focused on providing a superior customer experience, enabled by digital technologies. Previously, CEMEX needed a dedicated IT team in each of these regions to manage software maintenance, dashboard updates, report requests, and month-end reporting. At the end of each month, a surge of reporting and other data-intensive workloads created performance bottlenecks.
To modernize its data management strategy, CEMEX chose a cloud data platform that stores structured and semi-structured data as the foundation for both a data lake and a data warehouse, enabling secure and governed access to all data. The new data platform pow- ers CEMEX Go, a digital environment that automates order-to-cash workflows, supports online purchases, and tracks real-time orders in 21 countries. Each year, more than 500,000 payments and 2.5 million deliveries are completed through CEMEX Go, and nearly 90 percent of CEMEX’s customers use the environment. In the past, adding capacity required weeks of effort. Now, the cloud data platform scales automatically to meet short- and long-term needs cost-effectively, so CEMEX does not have to plan for infrastruc- ture upgrades. Having all data in one location simplifies reporting, customer dashboards, advanced analytics, and application develop- ment. One application evaluates GPS and traffic data to determine the best routes for the company’s ready-mix concrete trucks. Another calculates the optimal distribution of trucks based on the location of ready-mix concrete plants and the forecasted demand. CEMEX only pays for the compute and storage resources each user and applica- tion consumes.
Going forward, CEMEX plans to develop machine learning applications that leverage the data lake to identify upsell and cross-sell opportuni- ties and generate recommendations on pricing strategy, including dynamic pricing.
Chapter  5: Selecting a Modern Cloud Data Lake
IN THIS CHAPTER
Unleashing the full potential of your data
Simplifying data lake maintenance
Sharing and enriching data
Improving resiliency and business continuity
Maintaining data in multiple clouds
Optimizing time-to-value with an easy-to-use system
By leveraging the unique attributes of a modern cloud data platform, your data lake can accommodate the needs of many types of data professionals. It can consolidate data across multiple public clouds with unlimited scale, exceptional reliability, minimal maintenance, and cost-effective pricing. This chapter describes some of the essential factors to consider as you identify the right data platform for your modern cloud data lake.
Empowering Many Users, Workloads, and Tools
Whether it’s building new data applications or supporting new data science projects, a data lake must be able to keep up with the growth of your business. A modern cloud data lake should deliver all the resources you need, with instant elasticity and near- infinite scalability. You shouldn’t have to overprovision resources to meet peak demands. Storage and compute resources should be separate from one another yet logically integrated and designed to scale automatically and independently. This would allow the data lake to support a near-unlimited number of concurrent users and workloads and easily scale up and down to handle fluctuations in usage without adversely impacting performance or requiring the organization to purchase more capacity than it needs.
Of course, different data users have different language and tool preferences. For example, data analysts may prefer to work with data via SQL or a business intelligence tool, whereas a data scien- tist may prefer to use Python in Jupyter Notebooks. Given that a data lake is designed to be a one-stop shop for all data, it should enable many different types of data users and their data work- loads productively.
Reducing Overhead
All modern organizations depend on data, but none want to be saddled with tedious systems management and database administration tasks. How easy is it to subscribe to the service, load your data, authorize users, and launch your most critical data workloads? After your data lake is up and running, how easy is it to provision more resources and ensure great performance?
A modern cloud data lake should enable you to leverage all your data without having to provision infrastructure or manage a complex environment. Your skilled data professionals shouldn’t have to bother with infrastructure, such as expanding storage capacity, allocating computing resources, installing security patches, and optimizing query performance. Security, tuning, and autoscalability should be built into the cloud service, freeing up your skilled data professionals to focus on gaining the most value from your data.
Offload important but avoidable administrative chores with a fully managed cloud data platform so your IT professionals can shift their attention to value-added activities, such as discover- ing new ways to analyze, share, and monetize data. Free up your data professionals to maximize the value of the data lake and the utility of its data.
Using All Data Types
To accommodate all possible business needs, your data lake should be versatile enough to ingest and immediately query data of many different types. That includes unstructured data, such as audio and video files, and semi-structured data, such as JSON, CSV, and XML. It should also allow you to include open source data formats, such as Apache Parquet and ORC.
The promise of the modern data lake is to enable you to seamlessly combine these many types of data so that you don’t have to develop or maintain separate silos or storage buckets. With all your diverse data sources and metadata integrated into a single system, users can easily put that data to work and obtain data-driven insights.
But look a little deeper: Your data lake vendor may claim to “support” multiple data types, but how easy is it to synthesize them? For example, if you’re flowing structured relational data from a CRM application into your data lake, how easy is it to combine these CRM records with semi-structured JSON data from an ecommerce weblog? If you’re creating a machine learning (ML) model that monitors purchasing trends and predicts buying activity, can you dictate a schema for the JSON data that models these diverse data sources? Can you integrate a function for pro- cessing image files, say to pull in pictures from a product catalog? Do you have to figure out how to extract information from those images, or can you simply embed that function into a SQL query?
A good data lake stores diverse types of data in their native formats without creating new data silos and imposes a schema to stream- line access to all your data. You don’t have to develop or maintain separate storage environments for structured, semi-structured, and unstructured data. It is easy to load, combine, and analyze all data through a single interface while maintaining transactional integrity.
Here are some guidelines for smooth data management:
Establish a complete metadata layer to guide user analytics.
Standardize on an architecture that supports JSON, Avro, Parquet, and XML data, along with leading Open Table formats, such as Apache Iceberg, as needed.
Use data pipeline tools that allow for native data loading with transactional integrity.
Capturing Data of Various Latencies
Your data platform should include data pipeline tools to migrate data into your cloud data lake. Bulk-load processes work best for initial transfers, especially if you have many terabytes of data to load into the data lake. After that, you’ll most likely want to inte- grate only the changes that have occurred since the last data load, a processing technique known as change data capture (CDC).
Increasingly, real-time and near real-time data feeds are used for streaming data processes that load data continuously. These pro- cesses are designed to capture IoT data, weblog data, and other continuous sources emitted by mechanical equipment, environ- mental sensors, and digital devices, such as computer hardware and mobile phones. Distributed publishing/subscribing messag- ing services represent a popular way to send and receive stream- ing data. These services act as publishers and receivers to ensure that data is received by the subscribers. Examples include open source technologies such as Apache Kafka and commercial tech- nologies such as Amazon Kinesis, Microsoft Event Hubs, Google Cloud Pub/Sub, and Snowflake Snowpipe. Open source tools offer low-cost solutions, but they generally require more setup, tuning, and management than commercial tools.
Ensure your data pipelines can move data continuously and in batch mode. They must also easily support schema-on-read and handle the complex transformations required to rationalize dif- ferent data types without reducing the performance of production workloads or hindering user productivity.
Sharing and Enriching Data
A modern data lake should not only simplify the process of stor- ing, transforming, integrating, managing, and analyzing all types of data. It should also streamline how diverse teams share data, so they can collaborate on a common data set without having to maintain multiple copies of data or move it from place to place.
Modern data sharing enables any organization to share and receive live data, within minutes, in a governed and secure way — with almost none of the risk, cost, headache, and delay that con- tinue to plague traditional data-sharing methods. It permits organizations to share the data itself and services that can be applied to that data, such as data modeling services, data enrich- ment services, and even complete data applications.
Traditional data lakes aren’t capable of modern data sharing. These older architectures use file transfer protocol (FTP), appli- cation programming interfaces (APIs), email, and other repetitive methods to duplicate static data and make it available to con- sumers. Lack of security and governance prohibits these older data lakes from enabling unlimited, concurrent access by data consumers. They also produce static data that quickly becomes dated and must be refreshed with up-to-date versions. That means constant data movement and management.
Modern cloud data platforms enable you to easily share the data in your data lake and receive shared data across your business or with organizations external to your own in a secure and governed way — without moving it from place to place.
Your platform should also facilitate data sharing on a commer- cial scale by permitting organizations to tap into third-party data repositories, services, and streams. Using these modern data- sharing methods, organizations can share data with vendors, supply chain partners, logistics partners, customers, and many other constituents. They can also set up data-sharing services that turn their data lakes into profit centers. A multi-tenant architecture allows authorized members of the ecosystem to tap into live, read-only versions of data and data functions within the data lake, and this ready-to-use data is immediately availa- ble all the time.
Look beyond the first-party data you own and consider second- and third-party data to improve your ML models and discover previously unknown patterns, correlations, and insights. Acquir- ing this external data can allow you to gain deeper insights, streamline operations, better serve customers, and discover new revenue streams based on data.
ELEVATING DATA ANALYTICS
Identity company Okta helps organizations securely connect people and technology. Thousands of organizations, including JetBlue and Slack, use the Okta Identity Cloud to manage access and authentica- tion for employees, contractors, partners, and customers. To enable data-driven decision-making across the company, Okta ingests and analyzes large amounts of product configuration and usage data. However, Okta’s previous legacy cloud data architecture could not affordably scale to handle up to 500 million events per day from the Okta Identity Cloud. Resource contention led to multi-day data processing delays. Basic event stream queries took minutes to finish running, which negatively impacted the productivity of data ana- lysts. Large, month-end processes took up to nine hours to complete, and Okta could not surface the insights that people were asking for.
Realizing the need for a modern data environment, Okta created a data lake on a cloud data platform, in conjunction with external stor- age on AWS. Today, ingesting data from numerous sources into the data lake provides Okta with a single source of truth for BI reporting and ad hoc analytics. The new data platform separates compute from storage resources, which has allowed Okta to near-instantly and near- infinitely scale resources to support more business units and create more reports.
For example, Okta’s finance team can gather key metrics such as total customer count and net retention rate. Marketers have a unified view of advertising performance and attribution across all platforms, including Google, LinkedIn, and Facebook. Product teams monitor configuration and usage data to measure feature adoption and guide development decisions. And by combining product configuration and usage data with CRM data, Okta surfaces pipeline opportunities that have resulted in millions of dollars in revenue. Okta’s Director of Data and Analytics describes the platform as “a central nervous system that enables data sharing and self-service analytics.”
Improving Resilience for Business Continuity
Even the most robust information systems can fail. In some cases, floods, fires, earthquakes, and other natural disasters can wipe out entire data centers. In other cases, cyberattacks can result in data loss, data inconsistencies, and data corruption. And don’t forget the fallibility of internal personnel, such as when a data- base administrator inadvertently deletes a table from a database.
None of these crises or mishaps will cause lasting damage if your data lake architecture incorporates redundant processes and procedures to keep all data online, instantly available, and well protected so your critical workloads don’t experience downtime.
To establish a workable strategy for data protection and business continuity, start by identifying the business impact of an out- age for various workloads. From there, establish service level agreements (SLAs) that dictate your tolerance for downtime. What happens when a daily sales report is delayed? What is the downstream impact if an inventory dashboard isn’t refreshed for several hours? Which databases are used by client-facing appli- cations that drive revenue or customer experience? Answering these questions will help you understand user expectations and use those expectations to establish guidelines for data backups, data replication, data instance failover, and disaster recovery to ensure business continuity.
If avoiding downtime is critical to your operation, ensure your data lake provider uses data replication techniques, disaster recovery procedures, and instant failover technologies to insu- late your operation from all these incidents. All cloud data lakes should protect data and ensure business continuity by performing periodic backups. Suppose a particular region of a public cloud provider experiences an outage, or even all its regions experience one. In that case, the analytic operations and applications that need that data should automatically switch to a redundant copy of that data within seconds in another region or on another public cloud provider. Data retention requirements call for maintaining copies of all your data. It’s important to replicate that data among multiple, geographically dispersed locations to offer the best pos- sible data protection. The “triple redundancy” offered by some cloud vendors won’t do you any good if all three copies of your data are in the same cloud region when an unforeseen disaster strikes.
A complete data protection strategy also considers regulatory compliance and certification requirements, which may stipulate that data be retained for a certain length of time for legal and auditing purposes.
Finally, pay attention to performance. Data backup and repli- cation procedures are important, but if you don’t have the right technology, these tasks can consume valuable compute resources and interfere with production analytic workloads. To ensure the durability, resiliency, and availability of your data, a modern cloud data lake should manage replication programmatically in the background without interfering with whatever workloads are executing at the time. Good data backup, protection, and replica- tion procedures minimize, if not prevent, performance degrada- tion and data availability interruptions.
Errors can come from many places, including human error, acci- dental deletions, and cloud infrastructure failure. You must be prepared for all of these. As explained in the next section, the data in your data lake, along with all the metadata, should be peri- odically replicated to multiple clouds and regions in conjunction with disaster recovery procedures that allow your data-driven operations to quickly failover to a replicated instance.
Supporting Multiple Clouds
A complete data protection strategy should go beyond merely duplicating data within the same physical region or zone of a cloud computing and storage provider. The data platform pro- vider should be able to quickly shift all your production work- loads from one region to a different region and ideally from one cloud provider to another cloud provider to uphold your SLAs.
Some data lake services are multicloud, meaning they can run on more than one major public cloud. Although this capability maximizes flexibility, it also propagates silos and negates the fundamental principle of centralizing all data in a data lake. If you transition a workload to a different region or cloud, will the data pipelines remain intact? Will all data security procedures and data governance policies be enforced?
Although multicloud capabilities can be useful, cross-cloud capa- bilities are superior. This not only means a system can run on any cloud but that it can also store and use data services among various clouds. For example, it can store data on one cloud and process it on another. This superior architecture enables you to leverage investments in Amazon Web Services, Microsoft Azure, or Google Cloud Platform and bring them all together in a cohe- sive way — or seamlessly transition from one to another.
A modern cloud data lake should allow you to compile queries and coordinate database transactions across multiple regions and clouds, wherever data and users reside. It should also maintain transactional integrity for all data in any cloud worldwide. A com- mon metadata layer should enforce consistency, even when data is stored in multiple clouds and across multiple regions.
Accommodating New Storage Paradigms
New types of data and new data storage paradigms are constantly appearing. For example, such table formats as Apache Iceberg are popular because they add a SQL-like table structure and ACID transactions to the unstructured and semi-structured data stored in files and documents. This allows computing engines, such as Spark, Trino, PrestoDB, Flink, Hive, Amazon EMR, and Snow- flake, to easily manage and inspect the data. These newer data formats have tremendous momentum from the commercial and open source communities. Will your data platform support them if needed?
Whenever you adopt a storage paradigm or computing engine, opt for interoperability without compromising ease-of-use, enabling your technology professionals to work with their tools of choice, both now and in the future. For example, if you are currently using Avro files, you shouldn’t be forced to change to Parquet merely because your computing engine requires that storage for- mat. Opt for a solution that works with multiple storage para- digms, data formats, and computing engines as necessary for your business.
Paying for What You Use
Your cloud data lake should offer a consumption-based pricing model. Each user and workgroup should pay only for the precise storage and compute resources used in per-second increments, so you never have to pay for idle capacity.
Contrast this pricing strategy with a subscription model, which requires customers to pay a recurring price for a set number of licenses or seats to use the SaaS provider’s software. Although subscriptions work well for ensuring predictable revenue for SaaS vendors, this model can be challenging for customers. They must estimate upfront how many licenses they may need and pay a monthly fee without any guarantee of using all the licenses or features they contracted for.
To maximize your budget, ask yourself these questions: Are you paying for unused capacity? Have you purchased more storage and computing licenses than what’s necessary? Are you oversub- scribing to these resources to accommodate occasional but pre- dictable surges in demand?
Work with a cloud data lake vendor that offers consumption- based pricing so you can “pay as you go” for resources actually consumed. You shouldn’t agree to any multiyear licenses or ser- vice contracts, although you may get a better rate by committing to a minimum volume of usage.
To keep costs under control:
Pay for usage by the second, not by the minute or month or a time frame affected by the busiest day of the year.
Automatically increase and decrease data lake resources for daily, monthly, quarterly, or seasonal data surges.
Eliminate onerous capacity-planning exercises by easily assessing your day-to-day requirements.
CHAPTER 6 Six Steps for Planning Your Cloud Data Lake
IN THIS CHAPTER
Evaluating your needs
Migrating data and workloads
Establishing success criteria
Setting up a proof of concept
Quantifying value
Deploying a modern data lake requires careful planning and assessment of your current and future needs. Follow the steps in this chapter to get started.
Step 1: Review Requirements
Your data lake should allow you to store data in raw form, enable immediate exploration of that data, refine it in a consistent and managed way, and power a broad range of data-driven workloads. Consider these factors:
Data: Identify the sources, types, and locations of the data you plan to load into your data lake. Will you gather new data? Will you stage data from an existing data warehouse or data store? Consider not only the data you have now but how other types of data could improve your operations, such as by powering new predictive models.
Users: Determine who will be authorized to access the data, develop data-driven applications, and create new insights. Compare the skills possessed by your current team against your plans for the business. If you plan to democratize access to business users, what tools or techniques will make the data accessible to them?
Access: Do your business intelligence and data science tools use industry-standard interfaces and allow data profession- als to work with popular frameworks and languages, such as Structured Query Language (SQL), Python, Java, and R? In particular, ensure your new data platform is ANSI-SQL compliant so you can discover value hidden within the data lake, and quickly deliver data-driven insights to all your business users.
Sharing: Do you plan to share data across your organization or externally with customers or partners? If so, what types of data will you share, and will you use a data marketplace to monetize data? Identify archaic data sharing methods such as FTP and email, and consider how you can replace them with a modern data-sharing architecture.
Stewardship: Determine who will be responsible for data quality, data governance, and data security, both for your initial data loads and continuously as new data is ingested.
Step 2: Migrate or Start Fresh
You may have an existing cloud data warehouse that you want to extend with new data types. Or perhaps you have an on-premises data lake created in Hadoop and use a cloud object store for addi- tional data and files. Do you want to create a new data lake from scratch, using object storage from a general-purpose cloud pro- vider or add to an existing object store? Do you have historical data sets you would like to migrate? If so, you will probably want to set up a one-time bulk transfer of this historical information to the data lake, then establish a pipeline to stream data, continu- ously or periodically, as your websites, IoT devices, data applica- tions, and other apps generate new data.
Step 3: Establish Success Criteria
Identify important business and technical criteria, focusing on performance, concurrency, simplicity, and total cost of owner- ship (TCO). You should not have to install, configure, or maintain hardware and software. Backups, performance tuning, security updates, and other management tasks should be handled by the cloud solution provider. How will you define success once the data lake is in full production mode? Will your data-driven applications impact revenue? Do you plan to monetize data in your data lake?
Step 4: Evaluate Solutions
This book outlines the attributes you should be looking for in a modern cloud data lake. Popular choices include the following:
Do-it-yourself open source platforms, such as Hadoop, Spark, Presto, and Hudi, which offer great flexibility and scalability, yet typically require complex infrastructure, custom coding, skilled engineering, and extensive system management requirements
Object storage environments that use the near-boundless storage and compute services from Amazon, Google, Microsoft, and other vendors, on top of which you must develop and maintain your data lake environment
Specialized cloud data platform solutions optimized for storing, analyzing, and sharing large and diverse volumes of data, driven by a common layer of services that simplify security, governance, and metadata management
Whatever type of cloud data platform you choose, ensure it can easily integrate many types of data in one universal repository to avoid creating data silos. If you plan to store data in a public object store, opt for a data platform that can accommodate these storage environments without the need to lift, shift, or copy data. Finally, the solution should support your existing skills, tools, and exper- tise and offer robust security and governance capabilities.
Step 5: Set Up a Proof of Concept
A proof of concept (POC) tests a solution to determine how well it serves your needs and meets your success criteria. When setting up your POC, list all requirements and success criteria — not just the issues you’re trying to resolve, but everything possible with a cloud solution. Ensure the new data lake overcomes the drawbacks of your current data management and analytic systems, such as making it easy to combine structured, semi-structured, and unstructured data. Can the storage layer accommodate multiple file formats in an efficient, cost-effective way? If you plan to deploy predictive use cases, does the platform make it easy to develop and apply data science models?
Step 6: Quantify Value
One platform is easier to manage than several, so consider the degree to which your new data platform can eliminate data silos and minimize your reliance on multiple solutions. Pay close attention to the services offered by the data platform vendor. Will the vendor handle data lake administration, security, manage- ment, and maintenance? If so, will you need fewer technology professionals than you did in the past? How will your month-to- month cloud usage fees compare to what you might have spent previously for on-premises software licenses and maintenance contracts?
Assuming you outsource everything to the vendor, you can calcu- late the TCO based on the monthly subscription fee or incremental usage fees. If you opt for an infrastructure-as-a-service (IaaS) or platform-as-a-service (PaaS) solution, you need to add the costs of whatever software, administration, and services the solution doesn’t include. If you instead choose a cloud data platform, how much of this will the vendor manage for you? And don’t over- look the savings possible when a cloud solution is scaled up and down dynamically in response to changing demand and when the vendor only charges by the second.
0 notes
Text
don’t swim too far down, lest you get stuck and drown.
(a/n): Just a Chris McLean-centric piece I decided to write, because I have no self control. Also, constructive criticism is always appreciated! I haven’t written for these characters before, so let me know if any of them feel OOC at all.
Word Count: 1,365
Summary: It was only for a moment, but in that moment, he was at a loss.
The boat creaked, rocking gently with the flow of the waves. Chris leaned back in his chair, feet propped up on the railing of the small sailboat, flipping idly through a choose your own adventure novel he had brought along with him.
It was quiet, save for the occasional seagull flying by, and the faint chatter coming from below deck.
He sighed, barely processing the words written out on the page, a restless air surrounding him. He didn't want to go out today, but three phone calls from Blaineley pressuring him to go on a boat ride was enough to force him to leave the solitude of his house. Really, he had been half tempted to bring work with him, but decided against it, knowing full well his weeks and weeks of notes would be tossed in the ocean by her.
“Don’t look so depressed,” Blaineley’s voice sounded from beside him, “You’ll get even more wrinkles.”
Chris tilted his head back to look up at her, eyebrow raised at her sudden remark. One hand rested on her hip, while the other held a cocktail. Instead of her usual red dress, she wore more casual attire; a bright yellow tank top and jean shorts.
“Did you just call me old?” he glared at her, but there was no bite to his words.
She smiled teasingly, taking a sip of her drink, “Hm… did I?”
The TV host scoffed, marking the page of his book before closing it and setting it off to the side, attention focusing out to the very distant city. His thoughts wandered back to his house, back to all the notes littering his bedroom floor in a semi organized manner. He should be working.
“You know, Chris,” Blaineley spoke up, almost absentmindedly, “It’s okay to take a break every once in a while.”
“Says the woman, who does all her work at the very last minute,” Don shot back as he approached the two, a grin on his face. Like Blaineley, he wore casual attire; a short-sleeved, mint green shirt and cargo shorts.
“Hey, not all the time!” Blaineley argued, plopping down in the vacant chair beside Chris.
“Oh, sorry,” Don corrected, leaning back against the railing, grinning cheekily, “Most of the time.”
She huffed, sticking her tongue out as a retaliation.
Chris rolled his eyes at the two, a faint smile on his face, “Anyways,” he cut in, turning his focus to Don, “How was your little show? What was the name of it again?” he feigned ignorance, squinting and staring off into the distance as if he was trying to remember it, “Started with an R… or was it an I…?”
Don crossed his arms, glaring at his friend, “The Ridonculous Race,” he began, “Went absolutely outstanding.”
“Oh my god,” Blaineley sighed, “Can we talk about that final? What the hell even was that?”
Don and Chris exchanged a quick glance before turning their focus to the blond.
“Are you really that upset Geoff and Brody didn’t win?” Don asked, eyebrow raised.
“It wasn’t even about Geoff and his surfer buddy,” she explained, sitting back in her chair and rubbing her temple to fend off an oncoming headache, “It was those damn ice dancers!”
“Oh, don’t even get me started on those two!” Don groaned in annoyance, “They were the absolute worst!”
Blaineley swirled the liquid in her glass absentmindedly, “They were even worse than Heather and Alejandro combined!”
“I don’t know about that, dude,” Chris replied, a smug grin on his face, “Those two were pretty brutal during the competition,” he couldn’t help but laugh at the memories, “Man, watching those two tear up the other contestants was amazing.”
His face fell into a tight frown at the reminder of his show. He should be working, his thoughts bitterly reminded him. He should be back at his house working on ideas, scheduling meetings with the producers, reviewing the various tapes that contestants submitted. He shouldn’t be wasting time on a boat with the two people he hated the least in the world.
“Christopher,” Blaineley snapped, giving him a somewhat gentle nudge in the leg to gain his attention, “You could at least pretend you enjoy our company, you know,” her tone was light, almost playful.
Chris grimaced at the use of his full name, “It’s hard to enjoy anything when I have work I need to do,” he replied bitterly.
Blaineley squinted at him, a confused look crossing her face, “Since when have you ever cared about work this much?”
“Since—” Chris felt his throat tighten at the reminder, and for a second, just one short second that filled the dead air around them, he felt a very familiar hint of fear take hold.
“Chris…?” Don’s voice was so full of concern, it made Chris wonder why they were friends.
Carefully, the TV host stood from his seat, “Look,” he sighed, starting to pace around the boat in an attempt to keep his emotions in check, “I don’t expect either of you to get it,” he couldn’t help sounding resentful, even just for a moment, “But my—my career is on the line, here,” he combed a hand through his hair, unable to mask the exhausted tilt to his voice, “I started paying more attention to the views my show was getting and they—” he made various gestures with his hands as he spoke, as if to emphasize his point, “God, they were dropping. Like, comparing Total Drama Island with Pahkitew Island, it was—it was insane just how big of a difference there was!”
Blaineley and Don watched him carefully, waiting for Chris to finish his rant, silently wondering what they could say that might ease his worries, even just a little bit. Then again, perhaps they shouldn’t say anything at all and let the host wallow in his fears.
“I finally get picked up for two more seasons after seven years,” he turned to look at Blaineley and Don, eyes desperate for something he couldn’t quite name, “I can’t mess this up again. I need to get the views back, or else my life, everything I’ve worked so hard to build, is going to be gone,” his voice suddenly grew a little distant as his gaze fell onto the city, so far away, he couldn't even touch it, “I’ll be… gone…”
There was a long silence as the words hung in the air, deafening. Finally, after a moment, Blaineley was the first one to speak.
“God, you’re stupid,” she sighed, finishing off the rest of her drink.
Chris finally slowed his pacing, glaring at her in annoyance. Don found himself laughing at the sudden statement.
“Chris, even if you lose all your views and the show does end up being cancelled,” she stood from her chair, walking over to her friend and draping an arm over his shoulder, “You’re still Chris McLean, the Host with the Most, the guy everyone wants and wants to be,” she grinned, “So…” she flicked him in the nose, earning a sharp wince, “Stop worrying, would ya?”
He absentmindedly rubbed his sore nose, staring at Blaineley in some surprise. Chris hadn’t expected her to try and comfort him or to even bother listening to his mild tangent, and yet…
There was a distinct weight being added on his shoulder. He turned, meeting Don’s wide grin.
“She’s right, ya know,” he added, giving a light shrug, “And besides, it’s not like we don’t have sources for you to use whenever your show does get canceled.”
Chris scoffed, a faint smile on his face, “Not sure how I feel about that confidence.”
“I bet it gets trashed before season seven,” Blaineley chimed in, a smirk playing on her face.
“Oh, wow,” Chris crossed his arms, shaking his head in fake hurt, “Here I thought you were supposed to be my friends!”
And in that moment, the three began to laugh and a calm understanding settled over them. Chris still had work to do, he still had challenges to work on and video tapes to review, but it could wait. He still had tomorrow, after all.
For now, he would enjoy himself and take a much needed break.
46 notes · View notes
lsanchor · 6 years
Text
Tumblr media
source
34 notes · View notes
thespiderdoctor · 6 years
Text
Falling into stalingrad
Woodie, Webber, Wilson: come fly with us!
Wendy: come die with us!:D
1 note · View note
officialtele2n · 1 year
Text
FACTS ABOUT THE PARTICIPANTS
The reason why Alex is blind and wears a blindfold is because he was born with anophthalmia, which means he was born without both eyes.
Waylan was based off of Ellis from L4D2.
Jasper has replayed the Danganronpa series time and time again. He mainly kins Enoshima, but also kins Tsumiki, Ouma, and Komaeda.
1 note · View note
recipesotd-blog · 7 years
Text
TDWI Business Analytics, Data Science, and Data Management Training Schedule Announced for 2018
A Service For Professionals • Friday, January 26, 2018 • 11610 Sources •428,977,216 Articles •2,718,449 Readers Let’s block ads! <a target="_blank" href="https://blockads.fivefilters.org/accepta...Data Management News
View On WordPress
0 notes
Text
New TDWI Research Report Explores Artificial Intelligence and Advanced Analytics
New TDWI Research Report Explores Artificial Intelligence and Advanced Analytics
Report reveals how advances in hardware and new algorithmic developments are leading organizations to refresh their views on classic computational ideas /EIN News/ — SEATTLE, WA, Oct. 30, 2017 (GLOBE NEWSWIRE) — TDWI Research has released its newest Best Practices Report, Advanced Analytics: Moving Toward AI, Machine Learning, and Nat… Social Media Analytics News
View On WordPress
0 notes
gluedata01 · 2 years
Text
Guide for Beginners To Understand Master Data Management (MDM)
Master Data Management is an information management system, that offers opportunities for data quality management experts and master data governance experts. Master Data Management is a need for creating strategies for data quality management and toolset.
Tumblr media
Let's understand what is MDM? Experts define MDM in various ways and it is up to you which expert's definition you understand better. Here are some prominent definitions by experts: 1. “MDM is the practice of defining and maintaining consistent definitions of business entities, then sharing them via integration techniques across multiple IT systems within an enterprise and sometimes beyond to partnering companies or customers” — Philip Russom PhD. Industry Analyst, TDWI 2. “MDM is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets. Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.” — Gartner Why Master Data Management is becoming sought after? The following points will help you to understand the reason behind MDM's popularity. 1. The Impact of MDM issues on the Businesses - Master data is among the most important data that an organization holds, so it is imperative that problems of the past be fixed; minor errors in master data can lead to viral problems when propagated across a federated network. In the last decade, enterprise MDM has gained significant recognition for its ability to differentiate businesses. 2. Escalating Complexity and Globalization - A Master Data Management system goes right to the point of why Information Development is necessary. Organizations are becoming more and more federated, with more information and integration on a global scale than ever before. A successful approach relies on reducing complexity. From a data management perspective, globalization created a variety of additional challenges. It also includes issues relating to multilingual and multi-character sets, as well as a requirement for 24x7 data availability from global operations. 3. All Sides See a Major Opportunity - System integrators and product vendors can take advantage of MDM as it is a big, complex problem. Data hubs, which are part of MDM, have been developed as a new MDM technology. Each information management vendor has a strategy for solving the problem, and application-centric vendors (which started the MDM trend) see this as a new area of opportunity for integration and applications. Organizations with MDM issues are also taking a similar approach: confronting a variety of information management challenges provides them with a clear framework for framing the issue. What Challenges does Master Data Management Present? The following key points can help you to understand the challenges presented by MDM. 1. It is often difficult for people to decide where to start, prioritize, and focus 2. A lack of information governance (stewardship, ownership, policies) leads to high levels of complexity across an organization 3. The problem of domain values is especially challenging when they are stored across a number of different systems, especially product information 4. In many large organizations, customer data is stored in multiple systems across the enterprise, resulting in a high degree of overlap in the master data. 5. Many organizations have complex data quality issues related to master data, particularly for customers and address data from legacy systems. About GlueData: GlueData is an independently owned and managed company. The company is a global data and analytics consultancy that assists SAP clients in mastering their data. The company provides the best data management tool, data management tools, data migration services, and master data management services with help of their data migration consultants and experts. Our SimpleData Management methodology is aimed at reducing the complexity of data governance by focusing on what is most important from a data domain or objective point of view.
0 notes
etl-testing-tools · 3 years
Text
Data Validation Testing
At a recent TDWI virtual summit on “Data Integration and Data Quality”, I attended a session titled “Continuous Data Validation: Five Best Practices” by Andrew Cardno.
In this session, Andrew Cardno, one of the adjunct faculty at TDWI talked about the importance of validating data from the whole to the part, which means that the metrics or total should be validated before reconciling the detailed data or drill-downs. For example, revenue totals by product type should be the same in Finance, CRM, and Reporting systems.
Attending this talk reminded me of a Data Warehouse project I worked on at one of the federal agencies. The source system was a Case Management system with a Data Warehouse for reporting. We noticed that one of the key metrics “Number of Cases by Case Type” yielded different results when queried on the source database, the data warehouse, and the reports. Such discrepancies undermine the trust in the reports and the underlying data. The reason for the mismatch can be an unwanted filter or wrong join or error during the ETL process.
When it comes to the federal agency this report is sent to congress and they have a congressional mandate to ensure that the numbers are correct. For other industries such as Healthcare and Financial, compliance requirements require the data to be consistent across multiple systems in the enterprise. It is essential to reconcile the metrics and the underlying data across various systems in the enterprise.
Andrew talks about two primary methods for performing Data Validation testing techniques to help instill trust in the data and analytics.
Glassbox Data Validation Testing
Blackbox Data Validation Testing
I will go over these Data Validation testing techniques in more detail below and explain how the Datagaps DataOps suite can help automate Data Validation testing.
0 notes
Why is Hadoop Services in New York the most effective career move?
Tumblr media
Bilytica #1 Hadoop Services in New York is everywhere and there's almost an urgent need to collect and preserve whatever data is being generated, for the fear of missing out on something important. There's an enormous amount of knowledge floating around. What we do with it's all that matters immediately . This is often why Big Data Analytics is within the frontiers of IT.
Bilytica #1 Hadoop Services in New York
Tumblr media
Hadoop Services in New York has become crucial because it aids in improving business, decision makings and providing the most important edge over the competitors. This is applicable for organizations also as professionals within the Analytics domain. For professionals, who are skilled in Big Data Hadoop Course, there's an ocean of opportunities out there.
Why Big Data Analytics is that the Best Career move
If you're still not convinced by the very fact that Big Data Analytics is one among the most well liked skills, here are 6 more reasons for you to ascertain the large picture.
Soaring Demand for Analytics Professionals:
There are more job opportunities in Big Data management and Analytics than there were last year and lots of IT professionals are prepared to take a position with time and money for the training. The job trend graph for Big Data Analytics, from Indeed.com, proves that there's a growing trend for it and as a result there's a gentle increase within the number of job opportunities.
Huge Job Opportunities & Meeting the Skill Gap:
The demand for Hadoop Services in New York goes up steadily but there's an enormous deficit on the availability side. This is often happening globally and isn't restricted to any part of geography. In spite of massive Data Analytics being a ‘Hot’ job, there's still an outsized number of unfilled jobs across the world thanks to shortage of required skill. A McKinsey Global Institute study states that the US will face a shortage of about 190,000 data scientists and 1.5 million managers and analysts who can understand and make decisions using Big Data by 2018. To get in-depth knowledge on Data Science, you'll enroll for live Data Science Certification Training by Edureka with 24/7 support and lifelong access. India, currently has the very best concentration of analytics globally. In spite of this, the scarcity of knowledge analytics talent is especially acute and demand for talent is predicted to get on the upper side as more global organizations are outsourcing their work.
Big Data Analytics: A Top Priority during a lot of Organizations
According to the ‘Peer Research – Big Data Analytics’ survey, it has been concluded that Hadoop Services in New York is one among the highest priorities of the organizations participating within the survey as they believe that it improves the performances of their organizations.
Adoption of massive Data Analytics is Growing:
New technologies are now making it easier to perform increasingly sophisticated data analytics on very large and diverse datasets. This is often evident because the report from the info Warehousing Institute (TDWI) shows. consistent with this report, quite a 3rd of the respondents are currently using some sort of advanced analytics on Big Data, for Business Intelligence, Predictive Analytics and data processing tasks. With Big Data Analytics providing a foothold over the competition, the speed of implementation of the required Analytics tools has increased exponentially. In fact, most of the respondents of the ‘Peer Research – Big Data Analytics’ survey reported that they have already got a technique setup for handling Big Data Analytics. And people who are yet to return with a technique also are within the process of designing for it.
Analytics: A Key think about deciding
Hadoop Services in New York may be a key competitive resource for several companies. There's little question about that . consistent with the ‘Analytics Advantage’ survey overseen by Tom Davenport, ninety six percent of respondents feel that analytics will become more important to their organizations within the next three years. This is often because there's an enormous amount of knowledge that's not getting used and at now , only rudimentary analytics is being done. About forty nine percent of the respondents strongly believe that analytics may be a key way to think about better decision-making capabilities. Another sixteen percent love it for its superior key strategic initiatives.
MS Power BI services in Pakistan is the key factor to provide scorecards and insights for different departments of the organization which consider power bi services in Lahore Karachi Islamabad Pakistan as a key factor to restore its functionality with the help of insights developed by Power BI developer in Pakistan.
Businesses in Pakistan are always looking best Power BI services in Pakistan through official partners of Microsoft which are known as Power BI Partners in Lahore Karachi Islamabad Pakistan to ensure that the best support is provided to companies in Pakistan for their projects under a certified Power BI Partner in Pakistan.
Microsoft is a leading company globally that provides the best business intelligence solutions using Power BI services in Pakistan.
Companies are dependent upon the best Power BI consultants in Pakistan to build their data warehouse and data integrations layer for data modelling using Power BI solutions in Pakistan which is also known as Power BI partner in Pakistan.
Services We Offer:
Strategy
Competitive Intelligence
Marketing Analytics
Sales Analytics
Data Monetization
Predictive Analytics
Planning
Assessments
Roadmaps
Data Governance
Strategy & Architecture
Organization Planning
Proof of Value
Analytics
Data Visualization
Big Data Analytics
Machine Learning
BI Reporting Dashboards
Advanced Analytics & Data Science
CRM / Salesforce Analytics
Data
Big Data Architecture
Lean Analytics
Enterprise Data Warehousing
Master Data Management
System Optimization
Outsourcing
Software Development
Managed Services
On-Shore / Off-Shore
Cloud Analytics
Recruiting & Staffing
Click to Start Whatsapp Chatbot with Sales
Tumblr media
Mobile: +447745139693 Email: [email protected]
0 notes
candientuquocthinh · 3 years
Photo
Tumblr media
Đầu cân điện tử Marcus TD WI 1 cung cấp giải pháp đo lường chất lượng cao, giá tốt, hoạt động ổn định trong môi trường công nghiệp.
Xem thêm: https://candientuquocthinh.com/dau-can-dien-tu/dau-can-tdwi-1/
#candientu, #cânđiệntử, #candientuquocthinh, #đầu_cân_điện_tử
0 notes