#SQL datetime analytics | Explore Tumblr posts and blogs

govindhtech · 1 year ago

Text

How BigQuery Data Canvas Makes AI-Powered Insights Easy

A Gemini feature in BigQuery, the BigQuery Studio data canvas provides a graphical interface for analysis processes and natural language prompts for finding, transforming, querying, and visualising data.

A directed acyclic graph (DAG) is used by BigQuery data canvas for analysis workflows, giving you a graphical representation of your workflow. Working with many branches of inquiry in one location and iterating on query results are both possible with BigQuery data canvas.

BigQuery data canvas

The BigQuery data canvas is intended to support you on your path from data to insights. Working with data doesn’t require technical expertise of particular products or technologies. Using natural language, BigQuery data canvas and Dataplex metadata combine to find relevant tables.

Gemini in BigQuery is used by BigQuery data canvas to locate your data, build charts, create SQL, and create data summaries.

Capabilities

BigQuery data canvas lets you do the following:

Use keyword search syntax along with Dataplex metadata to find assets such as tables, views, or materialized views.

Use natural language for basic SQL queries such as the following:

Queries that contain FROM clauses, math functions, arrays, and structs.

JOIN operations for two tables.

Visualize data by using the following types graphic types:

Bar chart

Heat map

Line graph

Pie chart

Scatter chart

Create custom visualizations by using natural language to describe what you want.

Automate data insights.

Limitations

Natural language commands might not work well with the following:

BigQuery ML

Apache Spark

Object tables

BigLake

INFORMATION_SCHEMA views

JSON

Nested and repeated fields

Complex functions and data types such as DATETIME and TIMEZONE

Data visualizations don’t work with geomap charts.

A ground-breaking data analytics tool, BigQuery data canvas, a Gemini in BigQuery feature, streamlines the whole data analysis process from data preparation and discovery to analysis, visualisation, and collaboration – all in one location, all within BigQuery. You may ask questions in both plain English and a variety of other languages about your data using the BigQuery data canvas, which makes use of natural language processing.

Because sophisticated SQL queries don’t need to be developed using this easy method, data analysis is now accessible to both technical and non-technical people. You may examine, modify, and display your BigQuery data using data canvas without ever leaving the environment in which it is stored.

This blog post provides a technical walkthrough of a real-world scenario utilising the public github_repos dataset, along with an overview of BigQuery data canvas. Over 3TB of activity from 3M+ open-source repositories are included in this dataset. We’ll look at how to respond to inquiries like:

In a year, how many commits were made to a particular repository?

In a particular year, who authored the most repositories?

Over time, how many non-authored commits were applied?

Which users, at what time, contributed to a certain file?

You’ll see how data canvas manages intricate SQL operations from your natural language prompts, such as joining tables, extracting particular data items, unnesting fields, and converting timestamps. We’ll even show you how to use just one click to create intelligent summaries and visualisations.

BigQuery data canvas quickly overview

BigQuery data canvas is mostly used for three types of tasks: finding data, generating SQL, and generating insights.Image credit to Google Cloud

Find Data

To locate data in BigQuery using a rapid keyword search or a natural language text prompt, use data canvas.

Generate SQL

Additionally, you may use the BigQuery data canvas to have SQL code written for you using natural language prompts powered by Gemini.

Create Insights

At last, use a single click to uncover insights concealed within your data! Gemini creates visualisations for you automatically so you can see the story your data is telling.

Using the BigQuery data canvas

Let’s look at an example to help you better understand the potential impact that the BigQuery data canvas can have in your company. Businesses of all kinds, from big corporations to tiny startups, can gain from having a better grasp of the productivity of their development staff. Google Cloud will demonstrate in this in-depth technical tutorial how to leverage data canvas and the public dataset github_repos to provide insightful results in a shared workspace.

You’ll learn how data canvas simplifies the creation of sophisticated SQL queries by working through this example, which demonstrates how to create joins and unnested columns, convert timestamps, extract the month and year from date fields, and more. Gemini’s features make it simple to create these queries and use natural language to examine your data with illuminating visualisations.

Please be aware that using any LLM-enabled application successfully requires strong prompt engineering abilities, just like using many of the new AI products and services available today. Many people might believe that large language models (LLMs) aren’t very excellent at producing SQL right out of the box. However, in our experience, Gemini in BigQuery via data canvas may produce sophisticated SQL queries using the context of your data corpus if you use the appropriate prompting mechanisms. It is evident that data canvas uses natural language queries to decide the ordering, grouping, sorting, record count limitation, and SQL structure.

The github_repos dataset, which is 3TB+ in size and can be found in Bigquery Public Datasets, comprises information in numerous tables regarding commits, watch counts, and other activity on 3M+ open-source projects. We want to look at the Google Cloud Platform repository for this example. As always, before you begin, make sure you have the necessary IAM permissions. In addition, make sure you have the necessary rights to access the datasets and data canvas in order to run nodes properly.

Using data canvas makes it simple to explore every table in the github_repos dataset. Here, Google Cloud evaluate schema, details, and preview data in one panel while comparing datasets side by side.Image credit to Google cloud

After choosing your dataset, you can hover over the bottom of the node to branch it to query or join it with another table. The dataset for the following transformation node is shown by arrows. For clarity, you can give each node a name when sharing the canvas. You can delete, debug, duplicate, or run all of the nodes in a series using the options in the upper right corner. Results can be downloaded, and data can be exported to Looker Studio or Sheets. In the navigation panel, you can also inspect the DAG structure, restore previous versions, and rate SQL suggestions.

Google examine four main facets of their data while examining the github_repos dataset. They will attempt to ascertain the following:

1) The total number of commitments made in a single year

2) The quantity of written repos for a specific year

3) The total number of non-authored commits that were applied throughout time

4) Determine how many user commits there have been for a specific file at a specific time.

Utilise BigQuery data canvas to simplify data analysis

It might be challenging to interpret data for a new project or use case when working with large datasets that span multiple disciplines. This procedure can be streamlined by using data canvas. Data canvas helps you work more efficiently and quickly by streamlining data analysis using natural language-based SQL creation and visualisations. It also reduces the need for repetitive queries and lets you plan automatic data refreshes.

Read more on Govindhtech.com

#BigQuery #bigquerydata #BigQuerydatacanvas #Dataplexmetadata #GeminiinBigQuery #generatingSQL #GoogleCloud #LargeLanguageModels #googlecloudplatform #datacanvas #cloudcomputing #news #technews #technology #technologynews #technologytrends #govindhtech

0 notes

suiteviz-blog · 6 years ago

Text

SQL Functions are Available in NetSuite saved searches?

Numeric Functions

Examples

12345678910111213141516171819202122232425

ABS( {amount} )

ACOS( 0.35 )

ASIN( 1 )

ATAN( 0.2 )

ATAN2( 0.2, 0.3 )

BITAND( 5, 3 )

CEIL( {today}-{createddate} )

COS( 0.35 )

COSH( -3.15 )

EXP( {rate} )

FLOOR( {today}-{createddate} )

LN( 20 )

LOG( 10, 20 )

MOD( 3:56 pm-{lastmessagedate},7 )

NANVL( {itemisbn13}, '' )

POWER( {custcoldaystoship},-.196 )

REMAINDER( {transaction.totalamount}, {transaction.amountpaid} )

ROUND( ( {today}-{startdate} ), 0 )

SIGN( {quantity} )

SIN( 5.2 )

SINH( 3 )

SQRT( POWER( {taxamount}, 2 ) )

TAN( -5.2 )

TANH( 3 )

TRUNC( {amount}, 1 )

Character Functions Returning Character Values

Examples

12345678910111213141516

CHR( 13 )

CONCAT( {number},CONCAT( '_',{line} ) )

INITCAP( {customer.companyname} )

LOWER( {customer.companyname} )

LPAD( {line},3,'0' )

LTRIM( {companyname},'-' )

REGEXP_REPLACE( {name}, '^.*:', '' )

REGEXP_SUBSTR( {item},'[^:]+$' )

REPLACE( {serialnumber}, '&', ',' )

RPAD( {firstname},20 )

RTRIM( {paidtransaction.externalid},'-Invoice' )

SOUNDEX( {companyname} )

SUBSTR( {transaction.salesrep},1,3 )

TRANSLATE( {expensecategory}, ' ', '+' )

TRIM ( BOTH ',' FROM {custrecord_assetcost} )

UPPER( {unit} )

Character Functions Returning Number Values

Examples

12345

ASCII( {taxitem} )

INSTR( {messages.message}, 'cspdr3' )

LENGTH( {name} )

REGEXP_INSTR ( {item.unitstype}, '\d' )

TO_NUMBER( {quantity} )

Datetime Functions

Examples

123456789

ADD_MONTHS( {today},-1 )

LAST_DAY( {today} )

MONTHS_BETWEEN( SYSDATE,{createddate} )

NEXT_DAY( {today},'SATURDAY' )

ROUND( TO_DATE( '12/31/2014', 'mm/dd/yyyy' )-{datecreated} )

TO_CHAR( {date}, 'hh24' )

TO_DATE( '31.12.2011', 'DD.MM.YYYY' )

TRUNC( {today},'YYYY' )

Also see Sysdate in one of the example sections below.

NULL-Related Functions

Examples

1234

COALESCE( {quantitycommitted}, 0 )

NULLIF( {price}, 0 )

NVL( {quantity},'0' )

NVL2( {location}, 1, 2 )

Decode

Examples

DECODE( {systemnotes.name}, {assigned},'T','F' )

Sysdate

Examples

TO_DATE( SYSDATE, 'DD.MM.YYYY' )

TO_CHAR( SYSDATE, 'mm/dd/yyyy' )

See also TO_DATE and TO_CHAR in the Datetime Functions.

Case

Examples

12345

CASE {state}

WHEN 'NY' THEN 'New York'

WHEN 'CA' THEN 'California'

ELSE {state}

END

1234567

CASE

WHEN {quantityavailable} > 19 THEN 'In Stock'

WHEN {quantityavailable} > 1 THEN 'Limited Availability'

WHEN {quantityavailable} = 1 THEN 'The Last Piece'

WHEN {quantityavailable} IS NULL THEN 'Discontinued'

ELSE 'Out of Stock'

END

Analytic and Aggregate Functions

Examples

DENSE_RANK ( {amount} WITHIN GROUP ( ORDER BY {AMOUNT} ) )

123

DENSE_RANK( ) OVER ( PARTITION BY {name}ORDER BY {trandate} DESC )

KEEP( DENSE_RANK LAST ORDER BY {internalid} )

RANK( ) OVER ( PARTITION by {tranid} ORDER BY {line} DESC )

RANK ( {amount} WITHIN GROUP ( ORDER BY {amount} ) )

1 note · View note

awesomehenny888 · 4 years ago

Text

5 Sertifikasi SQL Terbaik untuk Meningkatkan Karir Anda di Tahun 2021

Jika Anda ingin bekerja dibidang data seperti data scientist, database administrator dan big data architect structured query language atau SQL adalah salah satu bahasa pemrograman yang wajib Anda kuasai. Akan tetapi, jika Anda ingin cepat direkrut oleh perusahaan besar, serttifikasi SQL wajib dimiliki. Banyak sertifikasi SQL yang bisa Anda dapatkan, apa saja sih? yuk simak ulasannya dibawah ini.

5 Sertifikasi SQL Terbaik

1. Bootcamp MySQL Utama: Udemy

Kursus Udemy ini menyediakan banyak sekali latihan untuk meningkatkan skill Anda, dimulai dengan dasar-dasar MySQL dan berlanjut hingga mengajarkan beberapa konsep lainnya. Kursus ini menyediakan banyak latihan. Terserah Anda untuk mengambil kursus dengan kecepatan yang Anda inginkan.

Kurikulum pelatihan

Ringkasan dan penginstalan SQL: SQL vs. MySQL, penginstalan di Windows dan Mac

Membuat database dan tabel: Pembuatan dan pelepasan tabel, tipe data dasar

Penyisipan data, NULL, NOT NULL, Primary keys, table constraints

Perintah CRUD: SELECT, UPDATE, DELETE, challenge exercises

Fungsi string: concat, substring, replace, reverse, char length, upper dan lower

Menggunakan karakter pengganti yang berbeda, order by, limit, like, wildcards

Fungsi agregat: count, group by, min, max, sum, avg

Tipe Data secara detail: char, varchar, decimal, float, double, date, time, datetime, now, curdate, curtime, timestamp

Operator logika: not equal, not like, greater than, less than, AND, OR, between, not in, in, case statements

Satu ke banyak: Joins, foreign keys, cross join, inner join, left join, right join, Many to many

Klon data Instagram: nstagram Clone Schema, Users Schema, likes, comments, photos, hashtags, complete schema

Bekerja dengan Big Data : JUMBO dataset, exercises

Memperkenalkan Node: Crash course on Node.js, npm, MySQL, and other languages

Membangun aplikasi web: setting up, connecting Express and MySQL, adding EJS templates, connecting the form

Database triggers: writing triggers, Preventing Instagram Self-Follows With Triggers, creating logger triggers, Managing Triggers, And A Warning

2. Learn SQL Basics for Data Science Specialization

Pelatihan ini bertujuan untuk menerapkan semua konsep SQL yang digunakan untuk ilmu data secara praktis. Kursus pertama dari spesialisasi ini adalah kursus dasar yang akan memungkinkan Anda mempelajari semua pengetahuan SQL yang nantinya akan Anda perlukan untuk kursus lainnya. Dalam pelatihan ini akan ada empat kursus:

SQL untuk Ilmu Data.

Data Wrangling, Analisis, dan Pengujian AB dengan SQL.

Komputasi Terdistribusi dengan Spark SQL.

SQL untuk Proyek Capstone Sains Data.

Kurikulum pelatihan

1. SQL for Data Science (14 hours)

Introduction, selecting, and fetching data using SQL.

Filtering, Sorting, and Calculating Data with SQL.

Subqueries and Joins in SQL.

Modifying and Analyzing Data with SQL.

2. Data Wrangling, Analysis, and AB Testing with SQL

Data of Unknown Quality.

Creating Clean Datasets.

SQL Problem Solving.

Case Study: AB Testing.

3. Distributed Computing with Spark SQL

Introduction to Spark.

Spark Core Concepts.

Engineering Data Pipelines.

Machine Learning Applications of Spark.

4. SQL for Data Science Capstone Project

Project Proposal and Data Selection/Preparation.

Descriptive Stats & Understanding Your Data.

Beyond Descriptive Stats (Dive Deeper/Go Broader).

Presenting Your Findings (Storytelling).

3. Excel to MySQL: Analytic Techniques for Business Specialization

Pelatihan Ini adalah spesialisasi dari Coursera yang bertujuan untuk menyentuh SQL dari sudut pandang bisnis. Jika Anda ingin mendalami ilmu data atau bidang terkait, pelatihan ini sangat bagus. Bersama dengan SQL, Anda juga akan mendapatkan berbagai keterampilan seperti Microsoft Excel, Analisis Bisnis, alat sains data, dan algoritme, serta lebih banyak lagi tentang proses bisnis. Ada lima materi dalam pelatihan ini:

Metrik Bisnis untuk Perusahaan Berdasarkan Data.

Menguasai Analisis Data di Excel.

Visualisasi Data dan Komunikasi dengan Tableau.

Mengelola Big Data dengan MySQL.

Meningkatkan Keuntungan Manajemen Real Estat: Memanfaatkan Analisis Data.

Kurikulum pelatihan

Metrik Bisnis untuk Perusahaan Berdasarkan Data (8 jam): Pengenalan metrik bisnis, pasar analitik bisnis, menerapkan metrik bisnis ke studi kasus bisnis.

Menguasai Analisis Data di Excel (21 jam): Esensi Excel, klasifikasi biner, pengukuran informasi, regresi linier, pembuatan model.

Visualisasi Data dan Komunikasi dengan Tableau (25 jam): Tableau, visualisasi, logika, proyek.

Mengelola Big Data dengan MySQL (41 jam): database relasional, kueri untuk satu tabel, mengelompokkan data, menangani data kompleks melalui kueri.

Meningkatkan Keuntungan Manajemen Real Estat: Memanfaatkan Analisis Data (23 jam): Ekstraksi dan Visualisasi data, pemodelan, arus kas, dan keuntungan, dasbor data.

4. MySQL for Data Analytics and BI

Pelatihan ini mencakup MySQL secara mendalam dan mulai dari dasar-dasar kemudian beralih ke topik SQL lanjutan. Pelatihan ni juga memiliki banyak latihan untuk menyempurnakan pengetahuan Anda.

Kurikulum pelatihan

Introduction to databases, SQL, and MySQL.

SQL theory: SQL as a declarative language, DDL, keywords, DML, DCL, TCL.

Basic terminologies: Relational database, primary key, foreign key, unique key, null values.

Installing MySQL: client-server model, setting up a connection, MySQL interface.

First steps in SQL: SQL files, creating a database, introduction to data types, fixed and floating data types, table creating, using the database, and tables.

MySQL constraints: Primary key constraints, Foreign key constraints, Unique key constraint, NOT NULL

SQL Best practices.

SQL Select, Insert, Update, Delete, Aggregate functions, joins, subqueries, views, Stored routines.

Advanced SQL Topics: Types of MySQL variables, session, and global variables, triggers, user-defined system variables, the CASE statement.

Combining SQL and Tableau.

5, Learning SQL Programming

Pelatihan ini sangat cocok untuk pemula dan mencakup semua aspek penting dari SQL. Pelatihan ini juga mencakup banyak file latihan yang dapat meningkatkan skill Anda.

Kurikulum pelatihan

Memilih data dari database.

Memahami jenis JOIN.

Tipe data, Matematika, dan fungsi yang membantu: Pilih gabungan, ubah data, menggunakan alias untuk mempersingkat nama bidang.

Tambahkan atau ubah data.

Mengatasi kesalahan SQL umum.

Itulah berbagai sertifikasi SQL yang bisa Anda ikuti demi menaikan skill agar cepat diterima oleh perusahaan besar. Tentu saja, pengalaman dan pengetahuan teknis itu penting, tetapi sertifikasi SQL menjadi faktor penentu ketika kandidat dengan profil serupa harus disaring.Baca juga :

3 Manfaat Mengikuti Training SQL Server Jakarta

0 notes

erossiniuk · 5 years ago

Text

Application Insights: select and filter

Azure Application Insights has a specific language and syntax for select and filter data different from Structured Query Language (SQL).

In this post, I am going to compare Analytics query language to SQL with examples for selection and filtration.

First, navigate to analytics page of any Application Insights App by clicking Logs tab in the overview page of the app.

Navigate to Analytics page

Then, analytics tab opens a new editor window that you can type your query in it.

Analytics Logs Query Editor

Now, in the query editor we are going to write our queries using the Analytics Query Language. The easiest way to understand this language is by referring to a well-known language which is SQL.

Select

First of all, to write a wild card query (i.e. query without filtration), all you need to write is the name of the log type you are searching for. For example, “requests”. This is equivalent in SQL to

SELECT * FROM requests

Select: retrieving all requests

Selecting specific fields

The keyword “project” is used to include specific fields in the query output. Copy this query to the query editor to validate your understanding of this rule.

requests | project resultCode, timestamp

This is equivalent in SQL to

SELECT resultCode, timestamp FROM requests

Selecting specific fields

Select number of records

The equivalent to SQL query

SELECT TOP 10 * FROM requests

in Analytics language is

requests | take 10

Select specific fields and the number 10 records

Filters

Filtering with non-null fields

The equivalent to SQL query

SELECT * FROM requests WHERE resultCode IS NOT NULL

in Analytics language is

requests | where isnotnull(resultCode)

Filtering with non-null fields: not Null Filtration

Filtering by comparing with dates

The equivalent to SQL query

SELECT * FROM requests WHERE timestamp > getdate()-1

in Analytics language is

requests | where timestamp > ago(1d)

The equivalent to SQL query

SELECT * FROM requests WHERE timestamp BETWEEN '2019-01-10' AND '2019-01-13'

in Analytics language is

requests | where timestamp > datetime(2020-07-10) and timestamp <= datetime(2020-07-11)

Filtering by comparing with dates

Filtering by comparing with strings

The equivalent to SQL query

SELECT * FROM requests WHERE itemType = 'request'

in Analytics language is

requests | where itemType == "request"

The equivalent to SQL query

SELECT * FROM requests WHERE itemType LIKE 'request%'

in Analytics language is

requests | where itemType startswith "request"

The equivalent to SQL query

SELECT * FROM requests WHERE itemType LIKE '%request%'

in Analytics language is

requests | where itemType contains "request"

Filtering with regular expressions

Analytics language has a keyword for regular expression comparisons as follows

requests | where itemType matches regex "request"

Filtering by comparing with Boolean

The equivalent to SQL query

SELECT * FROM requests WHERE !(success)

in Analytics language is

requests | where success == "False"

Filtering by comparing with Boolean

And this is it for Application Insights select and filter. Do you want more details about union? Follow me!

The post Application Insights: select and filter appeared first on PureSourceCode.

from WordPress https://www.puresourcecode.com/tools/application-insights-select-and-filter/

#Azure #Tools #application-insights #queries #sql

0 notes

siva3155 · 6 years ago

Text

300+ TOP SAS Interview Questions and Answers

SAS Interview Questions for freshers experienced :-

1. What is SAS? What are the functions does it performs? SAS means Statistical Analysis System, which is an integrated set of software products. Information retrieval and data management Writing reports and graphics Statistical analytics, econometrics and data mining Business planning, forecasting, and decision support Operation research and Project management Quality Improvement Data Warehousing Application Development 2. What is the basic structure of the SAS base program? The basic structure of SAS consist of ==DATA step, which recovers & manipulates data. ==PROC step, which interprets the data. 3. What is the basic syntax style in SAS? To run the program successfully, and you have the following basic elements: There should be a semi-colon at the end of every line A data statement that defines your data set Input statement There should be at least one space between each word or statement A run statement For example: In file ‘H: \StatHW\yourfilename.dat’; 4. Explain data step in SAS The Data step creates a SAS dataset which carries the data along with a “data dictionary.” The data dictionary holds information about the variables and their properties. 5.What is PDV? The logical area in the memory is represented by PDV or Program Data Vector. At the time, SAS creates a database of one observation at a time. An input buffer is created at the time of compilation which holds a record from an external file. The PDV is created following the input buffer creation. 6. Approximately what date is represented by the SAS date value of 730? 31st December 1961 7. Identify statements whose placement in the DATA step is critical. INPUT, DATA and RUN… 8. Does SAS 'Translate' (compile) or does it 'Interpret'? Compile 9. What does the RUN statement do? When SAS editor looks at Run it starts compiling the data or proc step, if you have more than one data step or proc step or if you have a proc step. Following the data step then you can avoid the usage of the run statement. 10. Why is SAS considered self-documenting? SAS is considered self documenting because during the compilation time it creates and stores all the information about the data set like the time and date of the data set creation later No. of the variables later labels all that kind of info inside the dataset and you can look at that info using proc contents procedure.

SAS Interview Questions 11. What are some good SAS programming practices for processing very large data sets? Sort them once, can use firstobs = and obs = , 12. What is the different between functions and PROCs that calculate the same simple descriptive statistics? Functions can used inside the data step and on the same data set but with proc's you can create a new data sets to output the results. May be more ........... 13. If you were told to create many records from one record, show how you would do this using arrays and with PROC TRANSPOSE? I would use TRANSPOSE if the variables are less use arrays if the var are more ................. depends 14. What is a method for assigning first.VAR and last.VAR to the BY groupvariable on unsorted data? In unsorted data you can't use First. or Last. 15. How do you debug and test your SAS program? First thing is look into Log for errors or warning or NOTE in some cases or use the debugger in SAS data step. 16. What other SAS features do you use for error trapping and data validation? Check the Log and for data validation things like Proc Freq, Proc means or some times proc print to look how the data looks like ........ 17. How would you combine 3 or more tables with different structures? I think sort them with common variables and use merge statement. I am not sure what you mean different structures. 18. What areas of SAS are you most interested in? BASE, STAT, GRAPH, ETSBriefly 19. Describe 5 ways to do a "table lookup" in SAS. Match Merging, Direct Access, Format Tables, Arrays, PROC SQL 20. What versions of SAS have you used (on which platforms)? SAS 9.1.3,9.0, 8.2 in Windows and UNIX, SAS 7 and 6.12 21. What are some good SAS programming practices for processing very large data sets? Sampling method using OBS option or subsetting, commenting the Lines, Use Data Null 22. What are some problems you might encounter in processing missing values? In Data steps? Arithmetic? Comparisons? Functions? Classifying data? The result of any operation with missing value will result in missing value. Most SAS statistical procedures exclude observations with any missing variable vales from an analysis. 23. How would you create a data set with 1 observation and 30 variables from a data set with 30observations and 1 variable? Using PROC TRANSPOSE 24. What is the different between functions and PROCs that calculate the same simple descriptive statistics? Proc can be used with wider scope and the results can be sent to a different dataset. Functions usually affect the existing datasets. 25. If you were told to create many records from one record, show how you would do this using array and with PROC TRANSPOSE? Declare array for number of variables in the record and then used Do loop Proc Transpose with VARstatement 26. What are _numeric_ and _character_ and what do they do? Will either read or writes all numeric and character variables in dataset. 27. How would you create multiple observations from a single observation? Using double Trailing @@ 28. For what purpose would you use the RETAIN statement? The retain statement is used to hold the values of variables across iterations of the data step. Normally, all variables in the data step are set to missing at the start of each iteration of the data step.What is the order of evaluation of the comparison operators: + - * / ** ()?(), **, *, /, +, - 29. How could you generate test data with no input data? Using Data Null and put statement 30. How do you debug and test your SAS programs? Using Obs=0 and systems options to trace the program execution in log. 31. What can you learn from the SAS log when debugging? It will display the execution of whole program and the logic. It will also display the error with line number so that you can and edit the program. 32. What is the purpose of _error_? It has only to values, which are 1 for error and 0 for no error. 33. How can you put a "trace" in your program? By using ODS TRACE ON 34. How does SAS handle missing values in: assignment statements, functions, a merge, an update, sort order, formats, PROCs? Missing values will be assigned as missing in Assignment statement. Sort order treats missing as second smallest followed by underscore. 35. How do you test for missing values? Using Subset functions like IF then Else, Where and Select. 36. How are numeric and character missing values represented internally? Character as Blank or “ and Numeric as. 37. Which date functions advances a date time or date/time value by a given interval? INTNX. 38. In the flow of DATA step processing, what is the first action in a typical DATA Step? When you submit a DATA step, SAS processes the DATA step and then creates a new SAS data set.( creation of input buffer and PDV) Compilation Phase Execution Phase 39. What are SAS/ACCESS and SAS/CONNECT? SAS/Access only process through the databases like Oracle, SQL-server, Ms-Access etc. SAS/Connect only use Server connection. 40. What is the one statement to set the criteria of data that can be coded in any step? OPTIONS Statement, Label statement, Keep / Drop statements. 41. What is the purpose of using the N=PS option? The N=PS option creates a buffer in memory which is large enough to store PAGESIZE (PS) lines and enables a page to be formatted randomly prior to it being printed. 42. What are the scrubbing procedures in SAS? Proc Sort with nodupkey option, because it will eliminate the duplicate values. 43. What are the new features included in the new version of SAS? The main advantage of version9 is faster execution of applications and centralized access of data and support. There are lots of changes has been made in the version 9 when we compared with the version8. The following are the few:SAS version 9 supports Formats longer than 8 bytes & is not possible with version 8. Length for Numeric format allowed in version 9 is 32 where as 8 in version 8. Length for Character names in version 9 is 31 where as in version 8 is 32. Length for numeric informat in version 9 is 31, 8 in version 8. Length for character names is 30, 32 in version 8.3 new informats are available in version 9 to convert various date, time and datetime forms of data into a SAS date or SAS time. ·ANYDTDTEW. - Converts to a SAS date value ·ANYDTTMEW. - Converts to a SAS time value. ·ANYDTDTMW. -Converts to a SAS datetime value.CALL SYMPUTX Macro statement is added in the version 9 which creates a macro variable at execution time in the data step by · Trimming trailing blanks · Automatically converting numeric value to character. New ODS option (COLUMN OPTION) is included to create a multiple columns in the output. 44. WHAT DIFFERRENCE DID YOU FIND AMONG VERSION 6 8 AND 9 OF SAS. The SAS 9 Architecture is fundamentally different from any prior version of SAS. In the SAS 9 architecture, SAS relies on a new component, the Metadata Server, to provide an information layer between the programs and the data they access. Metadata, such as security permissions for SAS libraries and where the various SAS servers are running, are maintained in a common repository. 45. What has been your most common programming mistake? Missing semicolon and not checking log after submitting program, Not using debugging techniques and not using Fsview option vigorously. Name several ways to achieve efficiency in your program.Efficiency and performance strategies can be classified into 5 different areas. CPU time Data Storage Elapsed time Input/Output Memory CPU Time and Elapsed Time- Base line measurements 46. Few Examples for efficiency violations:Retaining unwanted datasets Not sub setting early to eliminate unwanted records. Efficiency improving techniques: Using KEEP and DROP statements to retain necessary variables. Use macros for reducing the code. Using IF-THEN/ELSE statements to process data programming. Use SQL procedure to reduce number of programming steps. Using of length statements to reduce the variable size for reducing the Data storage. Use of Data _NULL_ steps for processing null data sets for Data storage. 47. What other SAS products have you used and consider yourself proficient in using? Data _NULL_ statement, Proc Means, Proc Report, Proc tabulate, Proc freq and Proc print, Proc Univariate etc. What is the significance of the 'OF' in X=SUM (OF a1-a4, a6, a9);If don’t use the OF function it might not be interpreted as we expect. For example the function above calculates the sum of a1 minus a4 plus a6 and a9 and not the whole sum of a1 to a4 & a6 and a9. It is true for mean option also. 48. How to use IF THEN ELSE in PROC SQL? PROC SQL; SELECT WEIGHT, CASE WHEN WEIGHT BETWEEN 0 AND 50 THEN ’LOW’ WHEN WEIGHT BETWEEN 51 AND 70 THEN ’MEDIUM’ WHEN WEIGHT BETWEEN 71 AND 100 THEN ’HIGH’ ELSE ’VERY HIGH’ END AS NEWWEIGHT FROM HEALTH; QUIT; 49. How to remove duplicates using PROC SQL? Proc SQL noprint; Create Table inter.Merged1 as Select distinct * from inter.readin ; Quit; 50. How to count unique values by a grouping variable? You can use PROC SQL with COUNT(DISTINCT variable_name) to determine the number of unique values for a column. 51. What is the one statement to set the criteria of data that can be coded in any step? Options statement. 52. What is the effect of the OPTIONS statement ERRORS=1? The –ERROR- variable ha a value of 1 if there is an error in the data for that observation and 0 if it is not. 53. What do the SAS log messages "numeric values have been converted to character" mean? What are the implications? It implies that automatic conversion took place to make character functions possible. 54. Why is a STOP statement needed for the POINT= option on a SET statement? Because POINT= reads only the specified observations SAS cannot detect an end-of-file condition as it would if the file were being read sequentially. 55. How do you control the number of observations and/or variables read or written? FIRSTOBS and OBS option SAS Questions and Answers Pdf Download Read the full article

0 notes

t-baba · 8 years ago

Photo

Pandas: The Swiss Army Knife for Your Data, Part 2

This is part two of a two-part tutorial about Pandas, the amazing Python data analytics toolkit.

In part one, we covered the basic data types of Pandas: the series and the data frame. We imported and exported data, selected subsets of data, worked with metadata, and sorted the data.

In this part, we'll continue our journey and deal with missing data, data manipulation, data merging, data grouping, time series, and plotting.

Dealing With Missing Values

One of the strongest points of pandas is its handling of missing values. It will not just crash and burn in the presence of missing data. When data is missing, pandas replaces it with numpy's np.nan (not a number), and it doesn't participate in any computation.

Let's reindex our data frame, adding more rows and columns, but without any new data. To make it interesting, we'll populate some values.

>>> df = pd.DataFrame(np.random.randn(5,2), index=index, columns=['a','b']) >>> new_index = df.index.append(pd.Index(['six'])) >>> new_columns = list(df.columns) + ['c'] >>> df = df.reindex(index=new_index, columns=new_columns) >>> df.loc['three'].c = 3 >>> df.loc['four'].c = 4 >>> df a b c one -0.042172 0.374922 NaN two -0.689523 1.411403 NaN three 0.332707 0.307561 3.0 four 0.426519 -0.425181 4.0 five -0.161095 -0.849932 NaN six NaN NaN NaN

Note that df.index.append() returns a new index and doesn't modify the existing index. Also, df.reindex() returns a new data frame that I assign back to the df variable.

At this point, our data frame has six rows. The last row is all NaNs, and all other rows except the third and the fourth have NaN in the "c" column. What can you do with missing data? Here are options:

Keep it (but it will not participate in computations).

Drop it (the result of the computation will not contain the missing data).

Replace it with a default value.

Keep the missing data --------------------- >>> df *= 2 >>> df a b c one -0.084345 0.749845 NaN two -1.379046 2.822806 NaN three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 five -0.322190 -1.699864 NaN six NaN NaN NaN Drop rows with missing data --------------------------- >>> df.dropna() a b c three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 Replace with default value -------------------------- >>> df.fillna(5) a b c one -0.084345 0.749845 5.0 two -1.379046 2.822806 5.0 three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 five -0.322190 -1.699864 5.0 six 5.000000 5.000000 5.0

If you just want to check if you have missing data in your data frame, use the isnull() method. This returns a boolean mask of your dataframe, which is True for missing values and False elsewhere.

>>> df.isnull() a b c one False False True two False False True three False False False four False False False five False False True six True True True

Manipulating Your Data

When you have a data frame, you often need to perform operations on the data. Let's start with a new data frame that has four rows and three columns of random integers between 1 and 9 (inclusive).

>>> df = pd.DataFrame(np.random.randint(1, 10, size=(4, 3)), columns=['a','b', 'c']) >>> df a b c 0 1 3 3 1 8 9 2 2 8 1 5 3 4 6 1

Now, you can start working on the data. Let's sum up all the columns and assign the result to the last row, and then sum all the rows (dimension 1) and assign to the last column:

>>> df.loc[3] = df.sum() >>> df a b c 0 1 3 3 1 8 9 2 2 8 1 5 3 21 19 11 >>> df.c = df.sum(1) >>> df a b c 0 1 3 7 1 8 9 19 2 8 1 14 3 21 19 51

You can also perform operations on the entire data frame. Here is an example of subtracting 3 from each and every cell:

>>> df -= 3 >>> df a b c 0 -2 0 4 1 5 6 16 2 5 -2 11 3 18 16 48

For total control, you can apply arbitrary functions:

>>> df.apply(lambda x: x ** 2 + 5 * x - 4) a b c 0 -10 -4 32 1 46 62 332 2 46 -10 172 3 410 332 2540

Merging Data

Another common scenario when working with data frames is combining and merging data frames (and series) together. Pandas, as usual, gives you different options. Let's create another data frame and explore the various options.

>>> df2 = df // 3 >>> df2 a b c 0 -1 0 1 1 1 2 5 2 1 -1 3 3 6 5 16

Concat

When using pd.concat, pandas simply concatenates all the rows of the provided parts in order. There is no alignment of indexes. See in the following example how duplicate index values are created:

>>> pd.concat([df, df2]) a b c 0 -2 0 4 1 5 6 16 2 5 -2 11 3 18 16 48 0 -1 0 1 1 1 2 5 2 1 -1 3 3 6 5 16

You can also concatenate columns by using the axis=1 argument:

>>> pd.concat([df[:2], df2], axis=1) a b c a b c 0 -2.0 0.0 4.0 -1 0 1 1 5.0 6.0 16.0 1 2 5 2 NaN NaN NaN 1 -1 3 3 NaN NaN NaN 6 5 16

Note that because the first data frame (I used only two rows) didn't have as many rows, the missing values were automatically populated with NaNs, which changed those column types from int to float.

It's possible to concatenate any number of data frames in one call.

Merge

The merge function behaves in a similar way to SQL join. It merges all the columns from rows that have similar keys. Note that it operates on two data frames only:

>>> df = pd.DataFrame(dict(key=['start', 'finish'],x=[4, 8])) >>> df key x 0 start 4 1 finish 8 >>> df2 = pd.DataFrame(dict(key=['start', 'finish'],y=[2, 18])) >>> df2 key y 0 start 2 1 finish 18 >>> pd.merge(df, df2, on='key') key x y 0 start 4 2 1 finish 8 18

Append

The data frame's append() method is a little shortcut. It functionally behaves like concat(), but saves some key strokes.

>>> df key x 0 start 4 1 finish 8 Appending one row using the append method() ------------------------------------------- >>> df.append(dict(key='middle', x=9), ignore_index=True) key x 0 start 4 1 finish 8 2 middle 9 Appending one row using the concat() ------------------------------------------- >>> pd.concat([df, pd.DataFrame(dict(key='middle', x=[9]))], ignore_index=True) key x 0 start 4 1 finish 8 2 middle 9

Grouping Your Data

Here is a data frame that contains the members and ages of two families: the Smiths and the Joneses. You can use the groupby() method to group data by last name and find information at the family level like the sum of ages and the mean age:

df = pd.DataFrame( dict(first='John Jim Jenny Jill Jack'.split(), last='Smith Jones Jones Smith Smith'.split(), age=[11, 13, 22, 44, 65])) >>> df.groupby('last').sum() age last Jones 35 Smith 120 >>> df.groupby('last').mean() age last Jones 17.5 Smith 40.0

Time Series

A lot of important data is time series data. Pandas has strong support for time series data starting with data ranges, going through localization and time conversion, and all the way to sophisticated frequency-based resampling.

The date_range() function can generate sequences of datetimes. Here is an example of generating a six-week period starting on 1 January 2017 using the UTC time zone.

>>> weeks = pd.date_range(start='1/1/2017', periods=6, freq='W', tz='UTC') >>> weeks DatetimeIndex(['2017-01-01', '2017-01-08', '2017-01-15', '2017-01-22', '2017-01-29', '2017-02-05'], dtype='datetime64[ns, UTC]', freq='W-SUN')

Adding a timestamp to your data frames, either as data column or as the index, is great for organizing and grouping your data by time. It also allows resampling. Here is an example of resampling every minute data as five-minute aggregations.

>>> minutes = pd.date_range(start='1/1/2017', periods=10, freq='1Min', tz='UTC') >>> ts = pd.Series(np.random.randn(len(minutes)), minutes) >>> ts 2017-01-01 00:00:00+00:00 1.866913 2017-01-01 00:01:00+00:00 2.157201 2017-01-01 00:02:00+00:00 -0.439932 2017-01-01 00:03:00+00:00 0.777944 2017-01-01 00:04:00+00:00 0.755624 2017-01-01 00:05:00+00:00 -2.150276 2017-01-01 00:06:00+00:00 3.352880 2017-01-01 00:07:00+00:00 -1.657432 2017-01-01 00:08:00+00:00 -0.144666 2017-01-01 00:09:00+00:00 -0.667059 Freq: T, dtype: float64 >>> ts.resample('5Min').mean() 2017-01-01 00:00:00+00:00 1.023550 2017-01-01 00:05:00+00:00 -0.253311

Plotting

Pandas supports plotting with matplotlib. Make sure it's installed: pip install matplotlib. To generate a plot, you can call the plot() of a series or a data frame. There are many options to control the plot, but the defaults work for simple visualization purposes. Here is how to generate a line graph and save it to a PDF file.

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2017', periods=1000)) ts = ts.cumsum() ax = ts.plot() fig = ax.get_figure() fig.savefig('plot.pdf')

Note that on macOS, Python must be installed as a framework for plotting with Pandas.

Conclusion

Pandas is a very broad data analytics framework. It has a simple object model with the concepts of series and data frame and a wealth of built-in functionality. You can compose and mix pandas functions and your own algorithms.

Additionally, don’t hesitate to see what we have available for sale and for study in the marketplace, and don't hesitate to ask any questions and provide your valuable feedback using the feed below.

Data importing and exporting in pandas are very extensive too and ensure that you can integrate it easily into existing systems. If you're doing any data processing in Python, pandas belongs in your toolbox.

by Gigi Sayfan via Envato Tuts+ Code http://ift.tt/2gaPZ24

#CSS3 #HTML5 Branding #HTML5 Weekly #HTMLl5 demos #Javascript #Javascript Branding #Javascript Weekly

2 notes · View notes

alanajacksontx · 6 years ago

Text

Using Python to recover SEO site traffic (Part one)

Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.

The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.

Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.

This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!

Winners vs losers

Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.

When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.

I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.

A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.

Let’s walk over the steps to put this kind of analysis together in Python.

You can reference my carefully documented Google Colab notebook here.

Getting the data

We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.

Google Analytics Query Explorer provides the simplest approach to do this in Python.

Head on over to the Google Analytics Query Explorer

Click on the button at the top that says “Click here to Authorize” and follow the steps provided.

Use the dropdown menu to select the website you want to get data from.

Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.

Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.

Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.

Hit “Run Query” and let it run

Scroll down to the bottom of the page and look for the text box that says “API Query URI.”

Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”

At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as the variable called token (make sure to paste it inside the quotes)

Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.

Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!

First, let’s define placeholder variables to pass to the API

metrics = “,”.join([“ga:users”,”ga:newUsers”])

dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])

segment = “gaid::-5”

# Required, please fill in with your own GA information example: ga:23322342

gaid = “ga:23322342”

# Example: string-of-text-here from step 8.2

token = “”

# Example https://www.example.com or http://example.org

base_site_url = “”

# You can change the start and end dates as you like

start = “2017-06-01”

end = “2018-06-30”

The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.

def GAData(gaid, start, end, metrics, dimensions,

segment, token, max_results=10000):

“””Creates a generator that yields GA API data

in chunks of size `max_results`”””

#build uri w/ params

api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\

“start-date={start}&end-date={end}&metrics={metrics}&”\

“dimensions={dimensions}&segment={segment}&access_token={token}&”\

“max-results={max_results}”

# insert uri params

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

# Using yield to make a generator in an

# attempt to be memory efficient, since data is downloaded in chunks

r = requests.get(api_uri)

data = r.json()

yield data

if data.get(“nextLink”, None):

while data.get(“nextLink”):

new_uri = data.get(“nextLink”)

new_uri += “&access_token={token}”.format(token=token)

r = requests.get(new_uri)

data = r.json()

yield data

In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.

import pandas as pd

def to_df(gadata):

“””Takes in a generator from GAData()

creates a dataframe from the rows”””

df = None

for data in gadata:

if df is None:

df = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

else:

newdf = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

df = df.append(newdf)

print(“Gathered {} rows”.format(len(df)))

return df

Now, we can call the functions to load the Google Analytics data.

data = GAData(gaid=gaid, metrics=metrics, start=start,

end=end, dimensions=dimensions, segment=segment,

token=token)

data = to_df(data)

Analyzing the data

Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.

data.head(5)

This displays the first five rows of the data frame.

Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])

data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])

data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])

Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).

from urllib.parse import urlparse, urljoin

data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

data[‘url’] = urljoin(base_site_url, data[‘path’])

Now the fun part begins.

The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.

The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.

We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.

The .sum() method then takes the sum of every other column in the data frame within each group.

For more information on these methods please see the Pandas documentation for groupby.

For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause

# Change this depending on your needs

MIDPOINT_DATE = “2017-12-15”

before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Traffic totals before Shopify switch

totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

# Traffic totals after Shopify switch

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

You can check the totals before and after with this code and double check with the Google Analytics numbers.

print(“Traffic Totals Before: “)

print(“Row count: “, len(totals_before))

print(“Traffic Totals After: “)

print(“Row count: “, len(totals_after))

Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.

We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.

# Comparing pages from before and after the switch

change = totals_after.merge(totals_before,

left_on=”ga:landingPagePath”,

right_on=”ga:landingPagePath”,

suffixes=[“_after”, “_before”],

how=”outer”)

change.fillna(0, inplace=True)

Difference and percentage change

Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.

Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > 0]

losers = change[change[‘percent_change’] < 0]

no_change = change[change[‘percent_change’] == 0]

Sanity check

Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.

# Checking that the total traffic adds up

data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It should be True.

Results

Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.

losers.sort_values(“difference”).head(10)

You can do the same to review the winners and try to learn from them.

winners.sort_values(“difference”, ascending=False).head(10)

You can export the losing pages to a CSV or Excel using this.

losers.to_csv(“./losing-pages.csv”)

This seems like a lot of work to analyze just one site–and it is!

The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.

In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.

The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.

from IM Tips And Tricks https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/ from Rising Phoenix SEO https://risingphxseo.tumblr.com/post/182759232745

0 notes

kellykperez · 6 years ago

Text

Using Python to recover SEO site traffic (Part one)

Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.

Winners vs losers

I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.

Let’s walk over the steps to put this kind of analysis together in Python.

You can reference my carefully documented Google Colab notebook here.

Getting the data

We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.

Google Analytics Query Explorer provides the simplest approach to do this in Python.

Head on over to the Google Analytics Query Explorer

Click on the button at the top that says “Click here to Authorize” and follow the steps provided.

Use the dropdown menu to select the website you want to get data from.

Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.

Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.

Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.

Hit “Run Query” and let it run

Scroll down to the bottom of the page and look for the text box that says “API Query URI.”

Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”

Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!

First, let’s define placeholder variables to pass to the API

metrics = “,”.join([“ga:users”,”ga:newUsers”])

dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])

segment = “gaid::-5”

# Required, please fill in with your own GA information example: ga:23322342

gaid = “ga:23322342”

# Example: string-of-text-here from step 8.2

token = “”

# Example https://www.example.com or http://example.org

base_site_url = “”

# You can change the start and end dates as you like

start = “2017-06-01”

end = “2018-06-30”

def GAData(gaid, start, end, metrics, dimensions,

segment, token, max_results=10000):

“””Creates a generator that yields GA API data

in chunks of size `max_results`”””

#build uri w/ params

api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\

“start-date={start}&end-date={end}&metrics={metrics}&”\

“dimensions={dimensions}&segment={segment}&access_token={token}&”\

“max-results={max_results}”

# insert uri params

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

# Using yield to make a generator in an

# attempt to be memory efficient, since data is downloaded in chunks

r = requests.get(api_uri)

data = r.json()

yield data

if data.get(“nextLink”, None):

while data.get(“nextLink”):

new_uri = data.get(“nextLink”)

new_uri += “&access_token={token}”.format(token=token)

r = requests.get(new_uri)

data = r.json()

yield data

In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.

import pandas as pd

def to_df(gadata):

“””Takes in a generator from GAData()

creates a dataframe from the rows”””

df = None

for data in gadata:

if df is None:

df = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

else:

newdf = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

df = df.append(newdf)

print(“Gathered {} rows”.format(len(df)))

return df

Now, we can call the functions to load the Google Analytics data.

data = GAData(gaid=gaid, metrics=metrics, start=start,

end=end, dimensions=dimensions, segment=segment,

token=token)

data = to_df(data)

Analyzing the data

data.head(5)

This displays the first five rows of the data frame.

Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])

data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])

data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])

Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).

from urllib.parse import urlparse, urljoin

data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

data[‘url’] = urljoin(base_site_url, data[‘path’])

Now the fun part begins.

The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.

The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.

The .sum() method then takes the sum of every other column in the data frame within each group.

For more information on these methods please see the Pandas documentation for groupby.

For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause

# Change this depending on your needs

MIDPOINT_DATE = “2017-12-15”

before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Traffic totals before Shopify switch

totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

# Traffic totals after Shopify switch

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

You can check the totals before and after with this code and double check with the Google Analytics numbers.

print(“Traffic Totals Before: “)

print(“Row count: “, len(totals_before))

print(“Traffic Totals After: “)

print(“Row count: “, len(totals_after))

Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.

# Comparing pages from before and after the switch

change = totals_after.merge(totals_before,

left_on=”ga:landingPagePath”,

right_on=”ga:landingPagePath”,

suffixes=[“_after”, “_before”],

how=”outer”)

change.fillna(0, inplace=True)

Difference and percentage change

Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > 0]

losers = change[change[‘percent_change’] < 0]

no_change = change[change[‘percent_change’] == 0]

Sanity check

# Checking that the total traffic adds up

data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It should be True.

Results

losers.sort_values(“difference”).head(10)

You can do the same to review the winners and try to learn from them.

winners.sort_values(“difference”, ascending=False).head(10)

You can export the losing pages to a CSV or Excel using this.

losers.to_csv(“./losing-pages.csv”)

This seems like a lot of work to analyze just one site–and it is!

The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.

In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.

The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.

source https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/ from Rising Phoenix SEO http://risingphoenixseo.blogspot.com/2019/02/using-python-to-recover-seo-site.html

0 notes

evaaguilaus · 6 years ago

Text

Using Python to recover SEO site traffic (Part one)

Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.

Winners vs losers

I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.

Let’s walk over the steps to put this kind of analysis together in Python.

You can reference my carefully documented Google Colab notebook here.

Getting the data

We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.

Google Analytics Query Explorer provides the simplest approach to do this in Python.

Head on over to the Google Analytics Query Explorer

Click on the button at the top that says “Click here to Authorize” and follow the steps provided.

Use the dropdown menu to select the website you want to get data from.

Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.

Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.

Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.

Hit “Run Query” and let it run

Scroll down to the bottom of the page and look for the text box that says “API Query URI.”

Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”

Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!

First, let’s define placeholder variables to pass to the API

metrics = “,”.join([“ga:users”,”ga:newUsers”])

dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])

segment = “gaid::-5”

# Required, please fill in with your own GA information example: ga:23322342

gaid = “ga:23322342”

# Example: string-of-text-here from step 8.2

token = “”

# Example https://www.example.com or http://example.org

base_site_url = “”

# You can change the start and end dates as you like

start = “2017-06-01”

end = “2018-06-30”

def GAData(gaid, start, end, metrics, dimensions,

segment, token, max_results=10000):

“””Creates a generator that yields GA API data

in chunks of size `max_results`”””

#build uri w/ params

api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\

“start-date={start}&end-date={end}&metrics={metrics}&”\

“dimensions={dimensions}&segment={segment}&access_token={token}&”\

“max-results={max_results}”

# insert uri params

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

# Using yield to make a generator in an

# attempt to be memory efficient, since data is downloaded in chunks

r = requests.get(api_uri)

data = r.json()

yield data

if data.get(“nextLink”, None):

while data.get(“nextLink”):

new_uri = data.get(“nextLink”)

new_uri += “&access_token={token}”.format(token=token)

r = requests.get(new_uri)

data = r.json()

yield data

In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.

import pandas as pd

def to_df(gadata):

“””Takes in a generator from GAData()

creates a dataframe from the rows”””

df = None

for data in gadata:

if df is None:

df = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

else:

newdf = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

df = df.append(newdf)

print(“Gathered {} rows”.format(len(df)))

return df

Now, we can call the functions to load the Google Analytics data.

data = GAData(gaid=gaid, metrics=metrics, start=start,

end=end, dimensions=dimensions, segment=segment,

token=token)

data = to_df(data)

Analyzing the data

data.head(5)

This displays the first five rows of the data frame.

Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])

data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])

data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])

Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).

from urllib.parse import urlparse, urljoin

data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

data[‘url’] = urljoin(base_site_url, data[‘path’])

Now the fun part begins.

The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.

The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.

The .sum() method then takes the sum of every other column in the data frame within each group.

For more information on these methods please see the Pandas documentation for groupby.

For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause

# Change this depending on your needs

MIDPOINT_DATE = “2017-12-15”

before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Traffic totals before Shopify switch

totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

# Traffic totals after Shopify switch

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

You can check the totals before and after with this code and double check with the Google Analytics numbers.

print(“Traffic Totals Before: “)

print(“Row count: “, len(totals_before))

print(“Traffic Totals After: “)

print(“Row count: “, len(totals_after))

Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.

# Comparing pages from before and after the switch

change = totals_after.merge(totals_before,

left_on=”ga:landingPagePath”,

right_on=”ga:landingPagePath”,

suffixes=[“_after”, “_before”],

how=”outer”)

change.fillna(0, inplace=True)

Difference and percentage change

Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > 0]

losers = change[change[‘percent_change’] < 0]

no_change = change[change[‘percent_change’] == 0]

Sanity check

# Checking that the total traffic adds up

data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It should be True.

Results

losers.sort_values(“difference”).head(10)

You can do the same to review the winners and try to learn from them.

winners.sort_values(“difference”, ascending=False).head(10)

You can export the losing pages to a CSV or Excel using this.

losers.to_csv(“./losing-pages.csv”)

This seems like a lot of work to analyze just one site–and it is!

The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.

In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.

The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.

from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/

0 notes

srasamua · 6 years ago

Text

Using Python to recover SEO site traffic (Part one)

Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.

Winners vs losers

I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.

Let’s walk over the steps to put this kind of analysis together in Python.

You can reference my carefully documented Google Colab notebook here.

Getting the data

We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.

Google Analytics Query Explorer provides the simplest approach to do this in Python.

Head on over to the Google Analytics Query Explorer

Click on the button at the top that says “Click here to Authorize” and follow the steps provided.

Use the dropdown menu to select the website you want to get data from.

Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.

Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.

Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.

Hit “Run Query” and let it run

Scroll down to the bottom of the page and look for the text box that says “API Query URI.”

Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”

Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!

First, let’s define placeholder variables to pass to the API

metrics = “,”.join([“ga:users”,”ga:newUsers”])

dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])

segment = “gaid::-5”

# Required, please fill in with your own GA information example: ga:23322342

gaid = “ga:23322342”

# Example: string-of-text-here from step 8.2

token = “”

# Example https://www.example.com or http://example.org

base_site_url = “”

# You can change the start and end dates as you like

start = “2017-06-01”

end = “2018-06-30”

def GAData(gaid, start, end, metrics, dimensions,

segment, token, max_results=10000):

“””Creates a generator that yields GA API data

in chunks of size `max_results`”””

#build uri w/ params

api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\

“start-date={start}&end-date={end}&metrics={metrics}&”\

“dimensions={dimensions}&segment={segment}&access_token={token}&”\

“max-results={max_results}”

# insert uri params

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

# Using yield to make a generator in an

# attempt to be memory efficient, since data is downloaded in chunks

r = requests.get(api_uri)

data = r.json()

yield data

if data.get(“nextLink”, None):

while data.get(“nextLink”):

new_uri = data.get(“nextLink”)

new_uri += “&access_token={token}”.format(token=token)

r = requests.get(new_uri)

data = r.json()

yield data

In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.

import pandas as pd

def to_df(gadata):

“””Takes in a generator from GAData()

creates a dataframe from the rows”””

df = None

for data in gadata:

if df is None:

df = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

else:

newdf = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

df = df.append(newdf)

print(“Gathered {} rows”.format(len(df)))

return df

Now, we can call the functions to load the Google Analytics data.

data = GAData(gaid=gaid, metrics=metrics, start=start,

end=end, dimensions=dimensions, segment=segment,

token=token)

data = to_df(data)

Analyzing the data

data.head(5)

This displays the first five rows of the data frame.

Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])

data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])

data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])

Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).

from urllib.parse import urlparse, urljoin

data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

data[‘url’] = urljoin(base_site_url, data[‘path’])

Now the fun part begins.

The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.

The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.

The .sum() method then takes the sum of every other column in the data frame within each group.

For more information on these methods please see the Pandas documentation for groupby.

For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause

# Change this depending on your needs

MIDPOINT_DATE = “2017-12-15”

before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Traffic totals before Shopify switch

totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

# Traffic totals after Shopify switch

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

You can check the totals before and after with this code and double check with the Google Analytics numbers.

print(“Traffic Totals Before: “)

print(“Row count: “, len(totals_before))

print(“Traffic Totals After: “)

print(“Row count: “, len(totals_after))

Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.

# Comparing pages from before and after the switch

change = totals_after.merge(totals_before,

left_on=”ga:landingPagePath”,

right_on=”ga:landingPagePath”,

suffixes=[“_after”, “_before”],

how=”outer”)

change.fillna(0, inplace=True)

Difference and percentage change

Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > 0]

losers = change[change[‘percent_change’] < 0]

no_change = change[change[‘percent_change’] == 0]

Sanity check

# Checking that the total traffic adds up

data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It should be True.

Results

losers.sort_values(“difference”).head(10)

You can do the same to review the winners and try to learn from them.

winners.sort_values(“difference”, ascending=False).head(10)

You can export the losing pages to a CSV or Excel using this.

losers.to_csv(“./losing-pages.csv”)

This seems like a lot of work to analyze just one site–and it is!

The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.

In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.

The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.

from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/

0 notes

bambiguertinus · 6 years ago

Text

Using Python to recover SEO site traffic (Part one)

Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.

Winners vs losers

I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.

Let’s walk over the steps to put this kind of analysis together in Python.

You can reference my carefully documented Google Colab notebook here.

Getting the data

We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.

Google Analytics Query Explorer provides the simplest approach to do this in Python.

Head on over to the Google Analytics Query Explorer

Click on the button at the top that says “Click here to Authorize” and follow the steps provided.

Use the dropdown menu to select the website you want to get data from.

Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.

Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.

Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.

Hit “Run Query” and let it run

Scroll down to the bottom of the page and look for the text box that says “API Query URI.”

Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”

Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!

First, let’s define placeholder variables to pass to the API

metrics = “,”.join([“ga:users”,”ga:newUsers”])

dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])

segment = “gaid::-5”

# Required, please fill in with your own GA information example: ga:23322342

gaid = “ga:23322342”

# Example: string-of-text-here from step 8.2

token = “”

# Example https://www.example.com or http://example.org

base_site_url = “”

# You can change the start and end dates as you like

start = “2017-06-01”

end = “2018-06-30”

def GAData(gaid, start, end, metrics, dimensions,

segment, token, max_results=10000):

“””Creates a generator that yields GA API data

in chunks of size `max_results`”””

#build uri w/ params

api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\

“start-date={start}&end-date={end}&metrics={metrics}&”\

“dimensions={dimensions}&segment={segment}&access_token={token}&”\

“max-results={max_results}”

# insert uri params

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

# Using yield to make a generator in an

# attempt to be memory efficient, since data is downloaded in chunks

r = requests.get(api_uri)

data = r.json()

yield data

if data.get(“nextLink”, None):

while data.get(“nextLink”):

new_uri = data.get(“nextLink”)

new_uri += “&access_token={token}”.format(token=token)

r = requests.get(new_uri)

data = r.json()

yield data

In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.

import pandas as pd

def to_df(gadata):

“””Takes in a generator from GAData()

creates a dataframe from the rows”””

df = None

for data in gadata:

if df is None:

df = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

else:

newdf = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

df = df.append(newdf)

print(“Gathered {} rows”.format(len(df)))

return df

Now, we can call the functions to load the Google Analytics data.

data = GAData(gaid=gaid, metrics=metrics, start=start,

end=end, dimensions=dimensions, segment=segment,

token=token)

data = to_df(data)

Analyzing the data

data.head(5)

This displays the first five rows of the data frame.

Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])

data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])

data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])

Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).

from urllib.parse import urlparse, urljoin

data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

data[‘url’] = urljoin(base_site_url, data[‘path’])

Now the fun part begins.

The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.

The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.

The .sum() method then takes the sum of every other column in the data frame within each group.

For more information on these methods please see the Pandas documentation for groupby.

For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause

# Change this depending on your needs

MIDPOINT_DATE = “2017-12-15”

before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Traffic totals before Shopify switch

totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

# Traffic totals after Shopify switch

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

You can check the totals before and after with this code and double check with the Google Analytics numbers.

print(“Traffic Totals Before: “)

print(“Row count: “, len(totals_before))

print(“Traffic Totals After: “)

print(“Row count: “, len(totals_after))

Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.

# Comparing pages from before and after the switch

change = totals_after.merge(totals_before,

left_on=”ga:landingPagePath”,

right_on=”ga:landingPagePath”,

suffixes=[“_after”, “_before”],

how=”outer”)

change.fillna(0, inplace=True)

Difference and percentage change

Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > 0]

losers = change[change[‘percent_change’] < 0]

no_change = change[change[‘percent_change’] == 0]

Sanity check

# Checking that the total traffic adds up

data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It should be True.

Results

losers.sort_values(“difference”).head(10)

You can do the same to review the winners and try to learn from them.

winners.sort_values(“difference”, ascending=False).head(10)

You can export the losing pages to a CSV or Excel using this.

losers.to_csv(“./losing-pages.csv”)

This seems like a lot of work to analyze just one site–and it is!

The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.

In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.

The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.

from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/

0 notes

oscarkruegerus · 6 years ago

Text

Using Python to recover SEO site traffic (Part one)

Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.

Winners vs losers

I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.

Let’s walk over the steps to put this kind of analysis together in Python.

You can reference my carefully documented Google Colab notebook here.

Getting the data

We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.

Google Analytics Query Explorer provides the simplest approach to do this in Python.

Head on over to the Google Analytics Query Explorer

Click on the button at the top that says “Click here to Authorize” and follow the steps provided.

Use the dropdown menu to select the website you want to get data from.

Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.

Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.

Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.

Hit “Run Query” and let it run

Scroll down to the bottom of the page and look for the text box that says “API Query URI.”

Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”

Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!

First, let’s define placeholder variables to pass to the API

metrics = “,”.join([“ga:users”,”ga:newUsers”])

dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])

segment = “gaid::-5”

# Required, please fill in with your own GA information example: ga:23322342

gaid = “ga:23322342”

# Example: string-of-text-here from step 8.2

token = “”

# Example https://www.example.com or http://example.org

base_site_url = “”

# You can change the start and end dates as you like

start = “2017-06-01”

end = “2018-06-30”

def GAData(gaid, start, end, metrics, dimensions,

segment, token, max_results=10000):

“””Creates a generator that yields GA API data

in chunks of size `max_results`”””

#build uri w/ params

api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\

“start-date={start}&end-date={end}&metrics={metrics}&”\

“dimensions={dimensions}&segment={segment}&access_token={token}&”\

“max-results={max_results}”

# insert uri params

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

# Using yield to make a generator in an

# attempt to be memory efficient, since data is downloaded in chunks

r = requests.get(api_uri)

data = r.json()

yield data

if data.get(“nextLink”, None):

while data.get(“nextLink”):

new_uri = data.get(“nextLink”)

new_uri += “&access_token={token}”.format(token=token)

r = requests.get(new_uri)

data = r.json()

yield data

In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.

import pandas as pd

def to_df(gadata):

“””Takes in a generator from GAData()

creates a dataframe from the rows”””

df = None

for data in gadata:

if df is None:

df = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

else:

newdf = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

df = df.append(newdf)

print(“Gathered {} rows”.format(len(df)))

return df

Now, we can call the functions to load the Google Analytics data.

data = GAData(gaid=gaid, metrics=metrics, start=start,

end=end, dimensions=dimensions, segment=segment,

token=token)

data = to_df(data)

Analyzing the data

data.head(5)

This displays the first five rows of the data frame.

Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])

data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])

data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])

Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).

from urllib.parse import urlparse, urljoin

data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

data[‘url’] = urljoin(base_site_url, data[‘path’])

Now the fun part begins.

The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.

The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.

The .sum() method then takes the sum of every other column in the data frame within each group.

For more information on these methods please see the Pandas documentation for groupby.

For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause

# Change this depending on your needs

MIDPOINT_DATE = “2017-12-15”

before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Traffic totals before Shopify switch

totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

# Traffic totals after Shopify switch

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

You can check the totals before and after with this code and double check with the Google Analytics numbers.

print(“Traffic Totals Before: “)

print(“Row count: “, len(totals_before))

print(“Traffic Totals After: “)

print(“Row count: “, len(totals_after))

Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.

# Comparing pages from before and after the switch

change = totals_after.merge(totals_before,

left_on=”ga:landingPagePath”,

right_on=”ga:landingPagePath”,

suffixes=[“_after”, “_before”],

how=”outer”)

change.fillna(0, inplace=True)

Difference and percentage change

Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > 0]

losers = change[change[‘percent_change’] < 0]

no_change = change[change[‘percent_change’] == 0]

Sanity check

# Checking that the total traffic adds up

data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It should be True.

Results

losers.sort_values(“difference”).head(10)

You can do the same to review the winners and try to learn from them.

winners.sort_values(“difference”, ascending=False).head(10)

You can export the losing pages to a CSV or Excel using this.

losers.to_csv(“./losing-pages.csv”)

This seems like a lot of work to analyze just one site–and it is!

The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.

In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.

The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.

from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/

0 notes

sheilalmartinia · 6 years ago

Text

Using Python to recover SEO site traffic (Part one)

Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.

Winners vs losers

I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.

Let’s walk over the steps to put this kind of analysis together in Python.

You can reference my carefully documented Google Colab notebook here.

Getting the data

We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.

Google Analytics Query Explorer provides the simplest approach to do this in Python.

Head on over to the Google Analytics Query Explorer

Click on the button at the top that says “Click here to Authorize” and follow the steps provided.

Use the dropdown menu to select the website you want to get data from.

Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.

Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.

Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.

Hit “Run Query” and let it run

Scroll down to the bottom of the page and look for the text box that says “API Query URI.”

Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”

Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!

First, let’s define placeholder variables to pass to the API

metrics = “,”.join([“ga:users”,”ga:newUsers”])

dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])

segment = “gaid::-5”

# Required, please fill in with your own GA information example: ga:23322342

gaid = “ga:23322342”

# Example: string-of-text-here from step 8.2

token = “”

# Example https://www.example.com or http://example.org

base_site_url = “”

# You can change the start and end dates as you like

start = “2017-06-01”

end = “2018-06-30”

def GAData(gaid, start, end, metrics, dimensions,

segment, token, max_results=10000):

“””Creates a generator that yields GA API data

in chunks of size `max_results`”””

#build uri w/ params

api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\

“start-date={start}&end-date={end}&metrics={metrics}&”\

“dimensions={dimensions}&segment={segment}&access_token={token}&”\

“max-results={max_results}”

# insert uri params

api_uri = api_uri.format(

gaid=gaid,

start=start,

end=end,

metrics=metrics,

dimensions=dimensions,

segment=segment,

token=token,

max_results=max_results

)

# Using yield to make a generator in an

# attempt to be memory efficient, since data is downloaded in chunks

r = requests.get(api_uri)

data = r.json()

yield data

if data.get(“nextLink”, None):

while data.get(“nextLink”):

new_uri = data.get(“nextLink”)

new_uri += “&access_token={token}”.format(token=token)

r = requests.get(new_uri)

data = r.json()

yield data

In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.

import pandas as pd

def to_df(gadata):

“””Takes in a generator from GAData()

creates a dataframe from the rows”””

df = None

for data in gadata:

if df is None:

df = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

else:

newdf = pd.DataFrame(

data[‘rows’],

columns=[x[‘name’] for x in data[‘columnHeaders’]]

)

df = df.append(newdf)

print(“Gathered {} rows”.format(len(df)))

return df

Now, we can call the functions to load the Google Analytics data.

data = GAData(gaid=gaid, metrics=metrics, start=start,

end=end, dimensions=dimensions, segment=segment,

token=token)

data = to_df(data)

Analyzing the data

data.head(5)

This displays the first five rows of the data frame.

Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.

First, let’s convert the date to a datetime object and the metrics to numeric values.

data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])

data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])

data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])

Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).

from urllib.parse import urlparse, urljoin

data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)

data[‘url’] = urljoin(base_site_url, data[‘path’])

Now the fun part begins.

The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.

The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.

The .sum() method then takes the sum of every other column in the data frame within each group.

For more information on these methods please see the Pandas documentation for groupby.

For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause

# Change this depending on your needs

MIDPOINT_DATE = “2017-12-15”

before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]

after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]

# Traffic totals before Shopify switch

totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_before = totals_before.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

# Traffic totals after Shopify switch

totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\

.groupby(“ga:landingPagePath”).sum()

totals_after = totals_after.reset_index()\

.sort_values(“ga:newUsers”, ascending=False)

You can check the totals before and after with this code and double check with the Google Analytics numbers.

print(“Traffic Totals Before: “)

print(“Row count: “, len(totals_before))

print(“Traffic Totals After: “)

print(“Row count: “, len(totals_after))

Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.

# Comparing pages from before and after the switch

change = totals_after.merge(totals_before,

left_on=”ga:landingPagePath”,

right_on=”ga:landingPagePath”,

suffixes=[“_after”, “_before”],

how=”outer”)

change.fillna(0, inplace=True)

Difference and percentage change

Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.

change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]

change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]

winners = change[change[‘percent_change’] > 0]

losers = change[change[‘percent_change’] < 0]

no_change = change[change[‘percent_change’] == 0]

Sanity check

# Checking that the total traffic adds up

data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()

It should be True.

Results

losers.sort_values(“difference”).head(10)

You can do the same to review the winners and try to learn from them.

winners.sort_values(“difference”, ascending=False).head(10)

You can export the losing pages to a CSV or Excel using this.

losers.to_csv(“./losing-pages.csv”)

This seems like a lot of work to analyze just one site–and it is!

The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.

In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.

The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.

from Search Engine Watch https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/

0 notes