#BigQueryanalytics | Explore Tumblr posts and blogs

govindhtech · 1 month ago

Text

Earth Engine in BigQuery: A New Geospatial SQL Analytics

BigQuery Earth Engine

With Earth Engine directly integrated into BigQuery, Google Cloud has expanded its geographic analytics capabilities. Incorporating powerful raster analytics into BigQuery, this new solution from Google Cloud Next '25 lets SQL users analyse satellite imagery-derived geographical data.

Google Cloud customers prefer BigQuery for storing and accessing vector data, which represents buildings and boundaries as points, lines, or polygons. Earth Engine in BigQuery is suggested for processing and storing raster data like satellite imagery, which encodes geographic information as a grid of pixels with temperature, height, and land cover values.

“Earth Engine in BigQuery” mixes vector and raster analytics. This integration could improve access to advanced raster analysis and help solve real-world business problems.

Key features driving this integration:

BigQuery's new geography function is ST_RegionStats. This program extracts statistics from raster data inside geographic borders, similar to Earth Engine's reduceRegion function. Use an Earth Engine-accessible raster picture and a geographic region (vector data) to calculate mean, min, max, total, or count for pixels that traverse the geography.

BigQuery Sharing, formerly Analytics Hub, now offers Earth Engine in BigQuery datasets. This makes it easy to find data and access more datasets, many of which are ready for processing to obtain statistics for a region of interest. These datasets may include risk prediction, elevation, or emissions.

Raster analytics with this new feature usually has five steps:

Find vector data representing interest areas in a BigQuery table.

Find an Earth Engine raster dataset in BigQuery image assets, Cloud GeoTiff, or BigQuery Sharing.

Use ST_RegionStats() with the raster ID, vector geometries, and optional band name to aggregate intersecting data.

To understand, look at ST_RegionStats() output.

Use BigQuery Geo Viz to map analysis results.

This integration enables data-driven decision-making in sustainability and geographic application cases:

Climate, physical risk, and disaster response: Using drought, wildfire, and flood data in transportation, infrastructure, and urban design. For instance, using the Wildfire hazard to Communities dataset to assess wildfire risk or the Global River Flood Hazard dataset to estimate flood risk.

Assessing land-use, elevation, and cover for agricultural evaluations and supply chain management. This includes using JRC Global Forest Cover datasets or Forest Data Partnership maps to determine if commodities are grown in non-deforested areas.

Methane emissions monitoring: MethaneSAT L4 Area Sources data can identify methane emission hotspots from minor, distributed sources in oil and gas basins to enhance mitigation efforts.

Custom use cases: Supporting Earth Engine raster dataset imports into BigQuery image assets or Cloud Storage GeoTiffs.

BigQuery Sharing contains ST_RegionStats()'s raster data sources, where the assets.image.href column normally holds the raster ID for each image table. Cloud Storage GeoTIFFs in the US or US-central1 regions can be used with URIs. Earth Engine image asset locations like ‘ee://IMAGE_PATH’ are supported in BigQuery.

ST_RegionStats()'s include option lets users adjust computations by assigning pixel weights (0–1), with 0 representing missing data. If no weight is given, pixels are weighted by geometry position. Raster pixel size, or scale, affects calculation and output. Changing scale (e.g., using options => JSON ‘{“scale”: 1000}’) can reduce query runtime and cost for prototyping, but it may impact results and should not be used for production analysis.

ST_RegionStats() is charged individually under BigQuery Services since Earth Engine calculates. Costs depend on input rows, raster picture quality, input geography size and complexity, crossing pixels, image projection, and formula usage. Earth Engine quotas in BigQuery slot time utilisation can be changed to control expenses.

Currently, ST_RegionStats() queries must be run in the US, us-central1, or us-central2.

This big improvement in Google Cloud's geospatial analytics provides advanced raster capabilities and improves sustainability and other data-driven decision-making.

#EarthEngineinBigQuery #BigQuery #EarthEngine #geospatialanalytics #SQL #BigQueryanalytics #technology #TechNews #technologynews #news #govindhtech

0 notes

govindhtech · 8 months ago

Text

Boost AI Production With Data Agents And BigQuery Platform

Data accessibility can hinder AI adoption since so much data is unstructured and unmanaged. Data should be accessible, actionable, and revolutionary for businesses. A data cloud based on open standards, that connects data to AI in real-time, and conversational data agents that stretch the limits of conventional AI are available today to help you do this.

An open real-time data ecosystem

Google Cloud announced intentions to combine BigQuery into a single data and AI use case platform earlier this year, including all data formats, numerous engines, governance, ML, and business intelligence. It also announces a managed Apache Iceberg experience for open-format customers. It adds document, audio, image, and video data processing to simplify multimodal data preparation.

Volkswagen bases AI models on car owner’s manuals, customer FAQs, help center articles, and official Volkswagen YouTube videos using BigQuery.

New managed services for Flink and Kafka enable customers to ingest, set up, tune, scale, monitor, and upgrade real-time applications. Data engineers can construct and execute data pipelines manually, via API, or on a schedule using BigQuery workflow previews.

Customers may now activate insights in real time using BigQuery continuous queries, another major addition. In the past, “real-time” meant examining minutes or hours old data. However, data ingestion and analysis are changing rapidly. Data, consumer engagement, decision-making, and AI-driven automation have substantially lowered the acceptable latency for decision-making. The demand for insights to activation must be smooth and take seconds, not minutes or hours. It has added real-time data sharing to the Analytics Hub data marketplace in preview.

Google Cloud launches BigQuery pipe syntax to enable customers manage, analyze, and gain value from log data. Data teams can simplify data conversions with SQL intended for semi-structured log data.

Connect all data to AI

BigQuery clients may produce and search embeddings at scale for semantic nearest-neighbor search, entity resolution, semantic search, similarity detection, RAG, and recommendations. Vertex AI integration makes integrating text, photos, video, multimodal data, and structured data easy. BigQuery integration with LangChain simplifies data pre-processing, embedding creation and storage, and vector search, now generally available.

It previews ScaNN searches for large queries to improve vector search. Google Search and YouTube use this technology. The ScaNN index supports over one billion vectors and provides top-notch query performance, enabling high-scale workloads for every enterprise.

It is also simplifying Python API data processing with BigQuery DataFrames. Synthetic data can replace ML model training and system testing. It teams with Gretel AI to generate synthetic data in BigQuery to expedite AI experiments. This data will closely resemble your actual data but won’t contain critical information.

Finer governance and data integration

Tens of thousands of companies fuel their data clouds with BigQuery and AI. However, in the data-driven AI era, enterprises must manage more data kinds and more tasks.

BigQuery’s serverless design helps Box process hundreds of thousands of events per second and manage petabyte-scale storage for billions of files and millions of users. Finer access control in BigQuery helps them locate, classify, and secure sensitive data fields.

Data management and governance become important with greater data-access and AI use cases. It unveils BigQuery’s unified catalog, which automatically harvests, ingests, and indexes information from data sources, AI models, and BI assets to help you discover your data and AI assets. BigQuery catalog semantic search in preview lets you find and query all those data assets, regardless of kind or location. Users may now ask natural language questions and BigQuery understands their purpose to retrieve the most relevant results and make it easier to locate what they need.

It enables more third-party data sources for your use cases and workflows. Equifax recently expanded its cooperation with Google Cloud to securely offer anonymized, differentiated loan, credit, and commercial marketing data using BigQuery.

Equifax believes more data leads to smarter decisions. By providing distinctive data on Google Cloud, it enables its clients to make predictive and informed decisions faster and more agilely by meeting them on their preferred channel.

Its new BigQuery metastore makes data available to many execution engines. Multiple engines can execute on a single copy of data across structured and unstructured object tables next month in preview, offering a unified view for policy, performance, and workload orchestration.

Looker lets you use BigQuery’s new governance capabilities for BI. You can leverage catalog metadata from Looker instances to collect Looker dashboards, exploration, and dimensions without setting up, maintaining, or operating your own connector.

Finally, BigQuery has catastrophe recovery for business continuity. This provides failover and redundant compute resources with a SLA for business-critical workloads. Besides your data, it enables BigQuery analytics workload failover.

Gemini conversational data agents

Global organizations demand LLM-powered data agents to conduct internal and customer-facing tasks, drive data access, deliver unique insights, and motivate action. It is developing new conversational APIs to enable developers to create data agents for self-service data access and monetize their data to differentiate their offerings.

Conversational analytics

It used these APIs to create Looker’s Gemini conversational analytics experience. Combine with Looker’s enterprise-scale semantic layer business logic models. You can root AI with a single source of truth and uniform metrics across the enterprise. You may then use natural language to explore your data like Google Search.

LookML semantic data models let you build regulated metrics and semantic relationships between data models for your data agents. LookML models don’t only describe your data; you can query them to obtain it.

Data agents run on a dynamic data knowledge graph. BigQuery powers the dynamic knowledge graph, which connects data, actions, and relationships using usage patterns, metadata, historical trends, and more.

Last but not least, Gemini in BigQuery is now broadly accessible, assisting data teams with data migration, preparation, code assist, and insights. Your business and analyst teams can now talk with your data and get insights in seconds, fostering a data-driven culture. Ready-to-run queries and AI-assisted data preparation in BigQuery Studio allow natural language pipeline building and decrease guesswork.

Connect all your data to AI by migrating it to BigQuery with the data migration application. This product roadmap webcast covers BigQuery platform updates.

Read more on Govindhtech.com

#DataAgents #BigQuery #BigQuerypipesyntax #vectorsearch #BigQueryDataFrames #BigQueryanalytics #LookMLmodels #news #technews #technology #technologynews #technologytrends #govindhtech

0 notes

govindhtech · 8 months ago

Text

BigQuery Omni Cuts Multi-cloud Log Ingestion, Analysis Costs

What is BigQuery Omni?

BigQuery Omni is a multi-cloud data analytics solution that lets you use BigLake tables to perform BigQuery analytics on data kept in Azure Blob Storage or Amazon Simple Storage Service (Amazon S3). It offers a single interface for analyzing data from several public clouds, removing the need to relocate data and allowing you to learn from your data no matter where it is stored.

Many businesses use several public clouds to store their data. Due to the difficulty of gaining insights from all of the data, this data frequently becomes siloed. To evaluate the data, you need a multi-cloud data tool that is quick, affordable, and doesn’t contribute to the costs of decentralized data governance. With a single interface, BigQuery Omni helps us to lower these frictions.

Connecting to Amazon S3 or Blob Storage is a must for performing BigQuery analytics on your external data. You would need to establish a BigLake table that refers data from Blob Storage or Amazon S3 in order to query external data.

Additionally, data can be moved between clouds using cross-cloud transfer, or it can be queried between clouds using cross-cloud joins. One cross-cloud analytics option that BigQuery Omni provides is the freedom to replicate data as needed and the ability to examine data where it resides.

Google BigQuery Omni

Operating hundreds of separate applications across multiple platforms is not unusual in today’s data-centric enterprises. The enormous amount of logs generated by these applications poses a serious problem for log analytics. Furthermore, accuracy and retrieval are made more difficult by the widespread use of multi-cloud solutions, since the dispersed nature of the logs may make it more difficult to derive valuable insights.

In contrast to a traditional strategy, BigQuery Omni was created to help solve this problem and lower overall expenditures. We’ll go over the specifics in this blog post.

Log analysis includes a number of steps, including:

Gathering log data: gathers log data from the applications and/or infrastructure of the enterprise. A popular method for gathering this data is to save it in an object storage program like Google Cloud Storage in JSONL file format. It can be prohibitively expensive to move raw log data between clouds in a multi-cloud setup.

Normalization of log data: Various infrastructures and applications produce distinct JSONL files. The fields in each file are specific to the program or infrastructure that produced it. To make data analysis easier, these disparate fields are combined into a single set, which enables data analysts to do thorough and effective studies throughout the environment.

Indexing and storage: To lower storage and query expenses and improve query performance, normalized data should be stored effectively. Logs are often stored in a compressed columnar file format, such as Parquet.

Querying and visualization: Enable enterprises to run analytics queries to find known threads, abnormalities, or anti-patterns in the log data through querying and visualization.

Data lifecycle: While storage expenses persist, the usefulness of log data declines with age. A data lifecycle procedure must be established in order to maximize costs. Archiving logs after a month (it is rare to query log data older than a month) and deleting them after a year are usual practices. This strategy ensures that crucial data is always available while efficiently controlling storage expenses.

A common Architecture

Many businesses use the following architecture to apply log analysis in a multi-cloud setting:Image credit to Google Cloud

This architecture has advantages and disadvantages.

On the Plus side:

Data lifecycle: By utilizing pre-existing functionality from object storage solutions, data lifecycle management may be implemented really easily. For instance, you can provide the following data lifecycle policy in Cloud Storage: You can use the following policies: (a) delete any item older than a week, which will remove any JSONL files that were available during the Collection process; (b) archive any object older than a month, which will also remove your Parquet files; and (c) delete any object older than a year, which will also remove your Parquet files.

Minimal egress costs: By storing the data locally, you can avoid transmitting large amounts of unprocessed information between cloud providers.

From the negative perspective:

Normalization of log data: You will code and manage an Apache Spark workload for every application with logs you gather. In a time when (a) engineers are in short supply and (b) the use of microservices is expanding quickly, it is wise to steer clear of this.

Querying: You can’t do as much analysis and visualization if you spread your data over several cloud providers.

Querying: Using WHERE clauses to prevent partitions with archived files requires human error and is not a simple solution for excluding archived files created earlier in the data lifetime. Managing the table’s manifest by adding and removing divisions as necessary is one way to work with Iceberg Table. However, it is difficult to play with the Iceberg Table manifest by hand, and using a third-party solution only makes things more expensive.

A better way to address all of these issues would be to use BigQuery Omni, which is shown in the architecture below.

This method’s primary advantage is the removal of several Spark workloads and the need for software engineers to code and maintain them. Having a single product (BigQuery) manage the entire process, aside from storage and visualization, is another advantage of this system. You gain from cost savings as well. Below, we’ll go into more detail about each of these points.

An streamlined procedure for normalizing

The ability of BigQuery to automatically determine the schema of JSONL files and generate an external table pointing to them is a valuable feature. This function is especially helpful when working with multiple log schema formats. The JSONL content of any application can be accessed by defining a simple CREATE TABLE declaration.

Once there, you may program BigQuery to export the JSONL external table into compressed Parquet files in Hive format that are divided into hourly segments. An EXPORT DATA statement that may be programmed to execute once every hour is shown in the query below. This query’s SELECT statement only records the log data that was ingested during the previous hour and transforms it into a Parquet file with columns that have been normalized.

DECLARE hour_ago_rounded_string STRING; DECLARE hour_ago_rounded_timestamp DEFAULT DATETIME_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR), HOUR);

SET (hour_ago_rounded_string) = ( SELECT AS STRUCT FORMAT_TIMESTAMP(“%Y-%m-%dT%H:00:00Z”, hour_ago_rounded_timestamp, “UTC”) );

EXPORT DATA OPTIONS ( uri = CONCAT(‘[MY_BUCKET_FOR_PARQUET_FILES]/ingested_date=’, hour_ago_rounded_string, ‘/logs-*.parquet’), format = ‘PARQUET’, compression = ‘GZIP’, overwrite = true) AS ( SELECT [MY_NORMILIZED_FIELDS] EXCEPT(ingested_date) FROM [MY_JSONL_EXTERNAL_TABLE] as jsonl_table WHERE DATETIME_TRUNC(jsonl_table.timestamp, HOUR) = hour_ago_rounded_timestamp );

A uniform querying procedure for all cloud service providers

While using the same data warehouse platform across several cloud providers already improves querying, BigQuery Omni’s ability to perform cross-cloud joins is revolutionary for Log Analytics. Combining log data from many cloud providers was difficult prior to BigQuery Omni. Sending the raw data to a single master cloud provider results in large egress expenses due to the volume of data; yet, pre-processing and filtering it limits your capacity to do analytics on it. Cross-cloud joins allow you to execute a single query across several clouds and examine the outcomes.

Reduces TCO

This architecture’s ability to lower total cost of ownership (TCO) is its last and most significant advantage. There are 3 ways to measure this:

Decreased engineering resources: Apache Spark is eliminated from this procedure for two reasons. The first is that Spark code can be worked on and maintained without a software developer. By employing standard SQL queries, the log analytics team can complete the deployment process more quickly. The shared responsibility concept of BigQuery and BigQuery Omni, which are PaaS, is extended to data in AWS and Azure.

Lower compute resources: The most economical environment might not always be provided by Apache Spark. The application itself, the Apache Spark platform, and the virtual machine (VM) make up an Apache Spark solution. In comparison to Apache Spark, BigQuery uses slots (virtual CPUs, not virtual machines) and an export query that is transformed into C-compiled code during the export process can lead to quicker performance for this particular operation.

Lower egress expenses: BigQuery Omni eliminates the need to transfer raw data between cloud providers in order to get a consolidated view of the data by processing data in-situ and egressing only results through cross-cloud joins.

What is the best way to use BigQuery in this setting?

BigQuery has two compute pricing models for query execution:

On-demand pricing (per TiB): This pricing model charges you according to the quantity of bytes each query processes, but it does not charge you for the first 1 TiB of query data handled each month. Using this technique is not advised because log analytics tasks use a lot of data.

Capacity pricing (per slot-hour): Under this pricing model, you are billed for the amount of computing power that is utilized to execute queries over time, expressed in slots (virtual CPUs). This model utilizes editions of BigQuery. Slot commitments, which are dedicated capacity always available for your workloads, are less expensive than on-demand, and you can use the BigQuery autoscaler.

In order to conduct an empirical test, Google assigned 100 slots (baseline 0, maximum slots 100) to a project that aimed to export log JSONL data into a compressed Parquet format. This configuration allowed BigQuery to process 1PB of data daily without using up all 100 slots.

In order to enable the TCO reduction of Log Analytics workloads in a multi-cloud context, it proposed an architecture in this blog post that substitutes SQL queries running on BigQuery Omni for Apache Spark applications. Your particular data environment may benefit from this method’s ability to minimize overall DevOps complexity while lowering engineering, computation, and egress expenses.

BigQuery Omni pricing

Please refer to BigQuery Omni pricing for details on price and time-limited promotions.