#SQL datetime analytics
Explore tagged Tumblr posts
thedbahub · 1 year ago
Text
Grouping Data by Time Intervals in SQL Server: Hourly and 10-Minute Aggregations
In SQL Server, grouping data by time intervals such as by hour or by 10 minutes requires manipulation of the date and time values so that rows falling within each interval are grouped together. This can be achieved using the DATEPART function for hourly grouping or a combination of DATEPART and arithmetic operations for more granular groupings like every 10 minutes. Here’s how you can do…
View On WordPress
0 notes
govindhtech · 1 year ago
Text
How BigQuery Data Canvas Makes AI-Powered Insights Easy
Tumblr media
A Gemini feature in BigQuery, the BigQuery Studio data canvas provides a graphical interface for analysis processes and natural language prompts for finding, transforming, querying, and visualising data.
A directed acyclic graph (DAG) is used by BigQuery data canvas for analysis workflows, giving you a graphical representation of your workflow. Working with many branches of inquiry in one location and iterating on query results are both possible with BigQuery data canvas.
BigQuery data canvas
The BigQuery data canvas is intended to support you on your path from data to insights. Working with data doesn’t require technical expertise of particular products or technologies. Using natural language, BigQuery data canvas and Dataplex metadata combine to find relevant tables.
Gemini in BigQuery is used by BigQuery data canvas to locate your data, build charts, create SQL, and create data summaries.
Capabilities
BigQuery data canvas lets you do the following:
Use keyword search syntax along with Dataplex metadata to find assets such as tables, views, or materialized views.
Use natural language for basic SQL queries such as the following:
Queries that contain FROM clauses, math functions, arrays, and structs.
JOIN operations for two tables.
Visualize data by using the following types graphic types:
Bar chart
Heat map
Line graph
Pie chart
Scatter chart
Create custom visualizations by using natural language to describe what you want.
Automate data insights.
Limitations
Natural language commands might not work well with the following:
BigQuery ML
Apache Spark
Object tables
BigLake
INFORMATION_SCHEMA views
JSON
Nested and repeated fields
Complex functions and data types such as DATETIME and TIMEZONE
Data visualizations don’t work with geomap charts.
A ground-breaking data analytics tool, BigQuery data canvas, a Gemini in BigQuery feature, streamlines the whole data analysis process from data preparation and discovery to analysis, visualisation, and collaboration – all in one location, all within BigQuery. You may ask questions in both plain English and a variety of other languages about your data using the BigQuery data canvas, which makes use of natural language processing.
Because sophisticated SQL queries don’t need to be developed using this easy method, data analysis is now accessible to both technical and non-technical people. You may examine, modify, and display your BigQuery data using data canvas without ever leaving the environment in which it is stored.
This blog post provides a technical walkthrough of a real-world scenario utilising the public github_repos dataset, along with an overview of BigQuery data canvas. Over 3TB of activity from 3M+ open-source repositories are included in this dataset. We’ll look at how to respond to inquiries like:
In a year, how many commits were made to a particular repository?
In a particular year, who authored the most repositories?
Over time, how many non-authored commits were applied?
Which users, at what time, contributed to a certain file?
You’ll see how data canvas manages intricate SQL operations from your natural language prompts, such as joining tables, extracting particular data items, unnesting fields, and converting timestamps. We’ll even show you how to use just one click to create intelligent summaries and visualisations.
BigQuery data canvas quickly overview
BigQuery data canvas is mostly used for three types of tasks: finding data, generating SQL, and generating insights.Image credit to Google Cloud
Find Data
To locate data in BigQuery using a rapid keyword search or a natural language text prompt, use data canvas.
Generate SQL
Additionally, you may use the BigQuery data canvas to have SQL code written for you using natural language prompts powered by Gemini.
Create Insights
At last, use a single click to uncover insights concealed within your data! Gemini creates visualisations for you automatically so you can see the story your data is telling.
Using the BigQuery data canvas
Let’s look at an example to help you better understand the potential impact that the BigQuery data canvas can have in your company. Businesses of all kinds, from big corporations to tiny startups, can gain from having a better grasp of the productivity of their development staff. Google Cloud will demonstrate in this in-depth technical tutorial how to leverage data canvas and the public dataset github_repos to provide insightful results in a shared workspace.
You’ll learn how data canvas simplifies the creation of sophisticated SQL queries by working through this example, which demonstrates how to create joins and unnested columns, convert timestamps, extract the month and year from date fields, and more. Gemini’s features make it simple to create these queries and use natural language to examine your data with illuminating visualisations.
Please be aware that using any LLM-enabled application successfully requires strong prompt engineering abilities, just like using many of the new  AI products and services available today. Many people might believe that large language models (LLMs) aren’t very excellent at producing SQL right out of the box. However, in our experience, Gemini in BigQuery via data canvas may produce sophisticated SQL queries using the context of your data corpus if you use the appropriate prompting mechanisms. It is evident that data canvas uses natural language queries to decide the ordering, grouping, sorting, record count limitation, and SQL structure.
The github_repos dataset, which is 3TB+ in size and can be found in Bigquery Public Datasets, comprises information in numerous tables regarding commits, watch counts, and other activity on 3M+ open-source projects. We want to look at the Google Cloud Platform repository for this example. As always, before you begin, make sure you have the necessary IAM permissions. In addition, make sure you have the necessary rights to access the datasets and data canvas in order to run nodes properly.
Using data canvas makes it simple to explore every table in the github_repos dataset. Here, Google  Cloud evaluate schema, details, and preview data in one panel while comparing datasets side by side.Image credit to Google cloud
After choosing your dataset, you can hover over the bottom of the node to branch it to query or join it with another table. The dataset for the following transformation node is shown by arrows. For clarity, you can give each node a name when sharing the canvas. You can delete, debug, duplicate, or run all of the nodes in a series using the options in the upper right corner. Results can be downloaded, and data can be exported to Looker Studio or Sheets. In the navigation panel, you can also inspect the DAG structure, restore previous versions, and rate SQL suggestions.
Google examine four main facets of their data while examining the github_repos dataset. They will attempt to ascertain the following:
1) The total number of commitments made in a single year
2) The quantity of written repos for a specific year
3) The total number of non-authored commits that were applied throughout time
4) Determine how many user commits there have been for a specific file at a specific time.
Utilise BigQuery data canvas to simplify data analysis
It might be challenging to interpret data for a new project or use case when working with large datasets that span multiple disciplines. This procedure can be streamlined by using data canvas. Data canvas helps you work more efficiently and quickly by streamlining data analysis using natural language-based SQL creation and visualisations. It also reduces the need for repetitive queries and lets you plan automatic data refreshes.
Read more on Govindhtech.com
0 notes
suiteviz-blog · 6 years ago
Text
SQL Functions are Available in NetSuite saved searches?
Numeric Functions
Examples
12345678910111213141516171819202122232425
ABS( {amount} )
ACOS( 0.35 )
ASIN( 1 )
ATAN( 0.2 )
ATAN2( 0.2, 0.3 )
BITAND( 5, 3 )
CEIL( {today}-{createddate} )
COS( 0.35 )
COSH( -3.15 )
EXP( {rate} )
FLOOR( {today}-{createddate} )
LN( 20 )
LOG( 10, 20 )
MOD( 3:56 pm-{lastmessagedate},7 )
NANVL( {itemisbn13}, '' )
POWER( {custcoldaystoship},-.196 )
REMAINDER( {transaction.totalamount}, {transaction.amountpaid} )
ROUND( ( {today}-{startdate} ), 0 )
SIGN( {quantity} )
SIN( 5.2 )
SINH( 3 )
SQRT( POWER( {taxamount}, 2 ) )
TAN( -5.2 )
TANH( 3 )
TRUNC( {amount}, 1 )
Character Functions Returning Character Values
Examples
12345678910111213141516
CHR( 13 )
CONCAT( {number}​,​CONCAT( ​'​_​'​,​{​line}​ )​ )
INITCAP( {customer.​companyname}​ )
LOWER( {customer.​companyname}​ )
LPAD( {line},3,'0' )
LTRIM( {companyname},'-' )
REGEXP_REPLACE( {name}, '^.*:', '' )
REGEXP_SUBSTR( ​{​item}​,​'​[​^​:​]​+​$​'​ )
REPLACE( {serialnumber}, '&', ',' )
RPAD( {firstname},20 )
RTRIM( {paidtransaction.​externalid}​,​'​-​Invoice'​ )
SOUNDEX( {companyname} )
SUBSTR( {transaction.​salesrep}​,​1,​3 )
TRANSLATE( ​{​expensecategory}​,​ ' ', '+' )
TRIM ( BOTH ',' FROM {custrecord_assetcost} )
UPPER( {unit} )
Character Functions Returning Number Values
Examples
12345
ASCII( {taxitem} )
INSTR( {messages.message}, 'cspdr3' )
LENGTH( {name} )
REGEXP_INSTR ( {item.unitstype}, '\d' )
TO_NUMBER( {quantity} )
Datetime Functions
Examples
123456789
ADD_MONTHS( {today},-1 )
LAST_DAY( {today} )
MONTHS_BETWEEN( ​SYSDATE,​{​createddate}​ )
NEXT_DAY( {today},'SATURDAY' )
ROUND( TO_DATE( '12/31/2014', 'mm/dd/yyyy' )-{datecreated} )
TO_CHAR( {date}, 'hh24' )
TO_DATE( '31.12.2011', 'DD.MM.YYYY' )
TRUNC( {today},'YYYY' )
Also see Sysdate in one of the example sections below.
NULL-Related Functions
Examples
1234
COALESCE( {quantitycommitted}, 0 )
NULLIF( {price}, 0 )
NVL( {quantity},'0' )
NVL2( {location}, 1, 2 )
Decode
Examples
1
DECODE( {systemnotes.name}, {assigned},'T','F' )
Sysdate
Examples
1
TO_DATE( SYSDATE, 'DD.MM.YYYY' )
or
1
TO_CHAR( SYSDATE, 'mm/dd/yyyy' )
See also TO_DATE and TO_CHAR in the Datetime Functions.
Case
Examples
12345
CASE {state}
WHEN 'NY' THEN 'New York'
WHEN 'CA' THEN 'California'
ELSE {state}
END
or
1234567
CASE
WHEN {quantityavailable} > 19 THEN 'In Stock'
WHEN {quantityavailable} > 1 THEN 'Limited Availability'
WHEN {quantityavailable} = 1 THEN 'The Last Piece'
WHEN {quantityavailable} IS NULL THEN 'Discontinued'
ELSE 'Out of Stock'
END
Analytic and Aggregate Functions
Examples
1
DENSE_RANK ( {amount} WITHIN GROUP ( ORDER BY {AMOUNT} ) )
or
123
DENSE_RANK(  ) OVER ( PARTITION BY {name}ORDER BY {trandate} DESC )
KEEP( DENSE_RANK LAST ORDER BY {internalid} )
RANK(  ) OVER ( PARTITION by {tranid} ORDER BY {line} DESC )
or
1
RANK ( {amount} WITHIN GROUP ( ORDER BY {amount} ) )
1 note · View note
awesomehenny888 · 4 years ago
Text
5 Sertifikasi SQL Terbaik untuk Meningkatkan Karir Anda di Tahun 2021
Tumblr media
Jika Anda ingin bekerja dibidang data seperti data scientist, database administrator dan big data architect structured query language atau SQL adalah salah satu bahasa pemrograman yang wajib Anda kuasai. Akan tetapi, jika Anda ingin cepat direkrut oleh perusahaan besar, serttifikasi SQL wajib dimiliki. Banyak sertifikasi SQL yang bisa Anda dapatkan, apa saja sih? yuk simak ulasannya dibawah ini.
5 Sertifikasi SQL Terbaik
1. Bootcamp MySQL Utama: Udemy
Kursus Udemy ini menyediakan banyak sekali latihan untuk meningkatkan skill Anda, dimulai dengan dasar-dasar MySQL dan berlanjut hingga mengajarkan beberapa konsep lainnya. Kursus ini menyediakan banyak latihan. Terserah Anda untuk mengambil kursus dengan kecepatan yang Anda inginkan.
Kurikulum pelatihan
Ringkasan dan penginstalan SQL: SQL vs. MySQL, penginstalan di Windows dan Mac
Membuat database dan tabel: Pembuatan dan pelepasan tabel, tipe data dasar
Penyisipan data,  NULL, NOT NULL, Primary keys, table constraints
Perintah CRUD: SELECT, UPDATE, DELETE, challenge exercises
Fungsi string: concat, substring, replace, reverse, char length, upper dan lower
Menggunakan karakter pengganti yang berbeda, order by, limit, like, wildcards
Fungsi agregat: count, group by, min, max, sum, avg
Tipe Data secara detail: char, varchar, decimal, float, double, date, time, datetime, now, curdate, curtime, timestamp
Operator logika: not equal, not like, greater than, less than, AND, OR, between, not in, in, case statements
Satu ke banyak: Joins, foreign keys, cross join, inner join, left join, right join, Many to many
Klon data Instagram: nstagram Clone Schema, Users Schema, likes, comments, photos, hashtags, complete schema
Bekerja dengan Big Data : JUMBO dataset, exercises
Memperkenalkan Node: Crash course on Node.js, npm, MySQL, and other languages
Membangun aplikasi web: setting up, connecting Express and MySQL, adding EJS templates, connecting the form
Database triggers: writing triggers, Preventing Instagram Self-Follows With Triggers, creating logger triggers, Managing Triggers, And A Warning
2. Learn SQL Basics for Data Science Specialization
Pelatihan ini bertujuan untuk menerapkan semua konsep SQL yang digunakan untuk ilmu data secara praktis. Kursus pertama dari spesialisasi ini adalah kursus dasar yang akan memungkinkan Anda mempelajari semua pengetahuan SQL yang nantinya akan Anda perlukan untuk kursus lainnya. Dalam pelatihan ini akan ada empat kursus:
SQL untuk Ilmu Data.
Data Wrangling, Analisis, dan Pengujian AB dengan SQL.
Komputasi Terdistribusi dengan Spark SQL.
SQL untuk Proyek Capstone Sains Data.
Kurikulum pelatihan
1. SQL for Data Science (14 hours)
Introduction, selecting, and fetching data using SQL.
Filtering, Sorting, and Calculating Data with SQL.
Subqueries and Joins in SQL.
Modifying and Analyzing Data with SQL.
2. Data Wrangling, Analysis, and AB Testing with SQL
Data of Unknown Quality.
Creating Clean Datasets.
SQL Problem Solving.
Case Study: AB Testing.
3. Distributed Computing with Spark SQL
Introduction to Spark.
Spark Core Concepts.
Engineering Data Pipelines.
Machine Learning Applications of Spark.
4. SQL for Data Science Capstone Project
Project Proposal and Data Selection/Preparation.
Descriptive Stats & Understanding Your Data.
Beyond Descriptive Stats (Dive Deeper/Go Broader).
Presenting Your Findings (Storytelling).
3. Excel to MySQL: Analytic Techniques for Business Specialization
Pelatihan Ini adalah spesialisasi dari Coursera yang bertujuan untuk menyentuh SQL dari sudut pandang bisnis. Jika Anda ingin mendalami ilmu data atau bidang terkait, pelatihan ini sangat bagus. Bersama dengan SQL, Anda juga akan mendapatkan berbagai keterampilan seperti Microsoft Excel, Analisis Bisnis, alat sains data, dan algoritme, serta lebih banyak lagi tentang proses bisnis. Ada lima materi dalam pelatihan ini:
Metrik Bisnis untuk Perusahaan Berdasarkan Data.
Menguasai Analisis Data di Excel.
Visualisasi Data dan Komunikasi dengan Tableau.
Mengelola Big Data dengan MySQL.
Meningkatkan Keuntungan Manajemen Real Estat: Memanfaatkan Analisis Data.
Kurikulum pelatihan
Metrik Bisnis untuk Perusahaan Berdasarkan Data (8 jam): Pengenalan metrik bisnis, pasar analitik bisnis, menerapkan metrik bisnis ke studi kasus bisnis.
Menguasai Analisis Data di Excel (21 jam): Esensi Excel, klasifikasi biner, pengukuran informasi, regresi linier, pembuatan model.
Visualisasi Data dan Komunikasi dengan Tableau (25 jam): Tableau, visualisasi, logika, proyek.
Mengelola Big Data dengan MySQL (41 jam): database relasional, kueri untuk satu tabel, mengelompokkan data, menangani data kompleks melalui kueri.
Meningkatkan Keuntungan Manajemen Real Estat: Memanfaatkan Analisis Data (23 jam): Ekstraksi dan Visualisasi data, pemodelan, arus kas, dan keuntungan, dasbor data.
4. MySQL for Data Analytics and BI
Pelatihan ini mencakup MySQL secara mendalam dan mulai dari dasar-dasar kemudian beralih ke topik SQL lanjutan. Pelatihan ni juga memiliki banyak latihan untuk menyempurnakan pengetahuan Anda.
Kurikulum pelatihan
Introduction to databases, SQL, and MySQL.
SQL theory: SQL as a declarative language, DDL, keywords, DML, DCL, TCL.
Basic terminologies: Relational database, primary key, foreign key, unique key, null values.
Installing MySQL: client-server model, setting up a connection, MySQL interface.
First steps in SQL: SQL files, creating a database, introduction to data types, fixed and floating data types, table creating, using the database, and tables.
MySQL constraints: Primary key constraints, Foreign key constraints, Unique key constraint, NOT NULL
SQL Best practices.
SQL Select, Insert, Update, Delete, Aggregate functions, joins, subqueries, views, Stored routines.
Advanced SQL Topics: Types of MySQL variables, session, and global variables, triggers, user-defined system variables, the CASE statement.
Combining SQL and Tableau.
5, Learning SQL Programming
Pelatihan ini sangat cocok untuk pemula dan mencakup semua aspek penting dari SQL. Pelatihan ini juga mencakup banyak file latihan yang dapat meningkatkan skill Anda.
Kurikulum pelatihan
Memilih data dari database.
Memahami jenis JOIN.
Tipe data, Matematika, dan fungsi yang membantu: Pilih gabungan, ubah data, menggunakan alias untuk mempersingkat nama bidang.
Tambahkan atau ubah data.
Mengatasi kesalahan SQL umum.
Itulah berbagai sertifikasi SQL yang bisa Anda ikuti demi menaikan skill agar cepat diterima oleh perusahaan besar. Tentu saja, pengalaman dan pengetahuan teknis itu penting, tetapi sertifikasi SQL menjadi faktor penentu ketika kandidat dengan profil serupa harus disaring.Baca juga :
3 Manfaat Mengikuti Training SQL Server Jakarta
0 notes
erossiniuk · 5 years ago
Text
Application Insights: select and filter
Azure Application Insights has a specific language and syntax for select and filter data different from Structured Query Language (SQL).
In this post, I am going to compare Analytics query language to SQL with examples for selection and filtration.
First, navigate to analytics page of any Application Insights App by clicking Logs tab in the overview page of the app.
Tumblr media
Navigate to Analytics page
Then, analytics tab opens a new editor window that you can type your query in it.
Tumblr media
Analytics Logs Query Editor
Now, in the query editor we are going to write our queries using the Analytics Query Language. The easiest way to understand this language is by referring to a well-known language which is SQL.
Select
First of all, to write a wild card query (i.e. query without filtration), all you need to write is the name of the log type you are searching for. For example, “requests”. This is equivalent in SQL to
SELECT * FROM requests
Tumblr media
Select: retrieving all requests
Selecting specific fields
The keyword “project” is used to include specific fields in the query output. Copy this query to the query editor to validate your understanding of this rule.
requests | project resultCode, timestamp
This is equivalent in SQL to
SELECT resultCode, timestamp FROM requests
Tumblr media
Selecting specific fields
Select number of records
The equivalent to SQL query
SELECT TOP 10 * FROM requests
in Analytics language is
requests | take 10
Tumblr media
Select specific fields and the number 10 records
Filters
Filtering with non-null fields
The equivalent to SQL query
SELECT * FROM requests WHERE resultCode IS NOT NULL
in Analytics language is
requests | where isnotnull(resultCode)
Tumblr media
Filtering with non-null fields: not Null Filtration
Filtering by comparing with dates
The equivalent to SQL query
SELECT * FROM requests WHERE timestamp > getdate()-1
in Analytics language is
requests | where timestamp > ago(1d)
The equivalent to SQL query
SELECT * FROM requests WHERE timestamp BETWEEN '2019-01-10' AND '2019-01-13'
in Analytics language is
requests | where timestamp > datetime(2020-07-10) and timestamp <= datetime(2020-07-11)
Tumblr media
Filtering by comparing with dates
Filtering by comparing with strings
The equivalent to SQL query
SELECT * FROM requests WHERE itemType = 'request'
in Analytics language is
requests | where itemType == "request"
The equivalent to SQL query
SELECT * FROM requests WHERE itemType LIKE 'request%'
in Analytics language is
requests | where itemType startswith "request"
The equivalent to SQL query
SELECT * FROM requests WHERE itemType LIKE '%request%'
in Analytics language is
requests | where itemType contains "request"
Filtering with regular expressions
Analytics language has a keyword for regular expression comparisons as follows
requests | where itemType matches regex "request"
Filtering by comparing with Boolean
The equivalent to SQL query
SELECT * FROM requests WHERE !(success)
in Analytics language is
requests | where success == "False"
Tumblr media
Filtering by comparing with Boolean
And this is it for Application Insights select and filter. Do you want more details about union? Follow me!
The post Application Insights: select and filter appeared first on PureSourceCode.
from WordPress https://www.puresourcecode.com/tools/application-insights-select-and-filter/
0 notes
siva3155 · 6 years ago
Text
300+ TOP SAS Interview Questions and Answers
SAS Interview Questions for freshers experienced :-
1. What is SAS? What are the functions does it performs? SAS means Statistical Analysis System, which is an integrated set of software products. Information retrieval and data management Writing reports and graphics Statistical analytics, econometrics and data mining Business planning, forecasting, and decision support Operation research and Project management Quality Improvement Data Warehousing Application Development 2. What is the basic structure of the SAS base program? The basic structure of SAS consist of ==DATA step, which recovers & manipulates data. ==PROC step, which interprets the data. 3. What is the basic syntax style in SAS? To run the program successfully, and you have the following basic elements: There should be a semi-colon at the end of every line A data statement that defines your data set Input statement There should be at least one space between each word or statement A run statement For example: In file ‘H: \StatHW\yourfilename.dat’; 4. Explain data step in SAS The Data step creates a SAS dataset which carries the data along with a “data dictionary.” The data dictionary holds information about the variables and their properties. 5.What is PDV? The logical area in the memory is represented by PDV or Program Data Vector. At the time, SAS creates a database of one observation at a time. An input buffer is created at the time of compilation which holds a record from an external file. The PDV is created following the input buffer creation. 6. Approximately what date is represented by the SAS date value of 730? 31st December 1961 7. Identify statements whose placement in the DATA step is critical. INPUT, DATA and RUN… 8. Does SAS 'Translate' (compile) or does it 'Interpret'? Compile 9. What does the RUN statement do? When SAS editor looks at Run it starts compiling the data or proc step, if you have more than one data step or proc step or if you have a proc step. Following the data step then you can avoid the usage of the run statement. 10. Why is SAS considered self-documenting? SAS is considered self documenting because during the compilation time it creates and stores all the information about the data set like the time and date of the data set creation later No. of the variables later labels all that kind of info inside the dataset and you can look at that info using proc contents procedure.
Tumblr media
SAS Interview Questions 11. What are some good SAS programming practices for processing very large data sets? Sort them once, can use firstobs = and obs = , 12. What is the different between functions and PROCs that calculate the same simple descriptive statistics? Functions can used inside the data step and on the same data set but with proc's you can create a new data sets to output the results. May be more ........... 13. If you were told to create many records from one record, show how you would do this using arrays and with PROC TRANSPOSE? I would use TRANSPOSE if the variables are less use arrays if the var are more ................. depends 14. What is a method for assigning first.VAR and last.VAR to the BY groupvariable on unsorted data? In unsorted data you can't use First. or Last. 15. How do you debug and test your SAS program? First thing is look into Log for errors or warning or NOTE in some cases or use the debugger in SAS data step. 16. What other SAS features do you use for error trapping and data validation? Check the Log and for data validation things like Proc Freq, Proc means or some times proc print to look how the data looks like ........ 17. How would you combine 3 or more tables with different structures? I think sort them with common variables and use merge statement. I am not sure what you mean different structures. 18. What areas of SAS are you most interested in? BASE, STAT, GRAPH, ETSBriefly 19. Describe 5 ways to do a "table lookup" in SAS. Match Merging, Direct Access, Format Tables, Arrays, PROC SQL 20. What versions of SAS have you used (on which platforms)? SAS 9.1.3,9.0, 8.2 in Windows and UNIX, SAS 7 and 6.12 21. What are some good SAS programming practices for processing very large data sets? Sampling method using OBS option or subsetting, commenting the Lines, Use Data Null 22. What are some problems you might encounter in processing missing values? In Data steps? Arithmetic? Comparisons? Functions? Classifying data? The result of any operation with missing value will result in missing value. Most SAS statistical procedures exclude observations with any missing variable vales from an analysis. 23. How would you create a data set with 1 observation and 30 variables from a data set with 30observations and 1 variable? Using PROC TRANSPOSE 24. What is the different between functions and PROCs that calculate the same simple descriptive statistics? Proc can be used with wider scope and the results can be sent to a different dataset. Functions usually affect the existing datasets. 25. If you were told to create many records from one record, show how you would do this using array and with PROC TRANSPOSE? Declare array for number of variables in the record and then used Do loop Proc Transpose with VARstatement 26. What are _numeric_ and _character_ and what do they do? Will either read or writes all numeric and character variables in dataset. 27. How would you create multiple observations from a single observation? Using double Trailing @@ 28. For what purpose would you use the RETAIN statement? The retain statement is used to hold the values of variables across iterations of the data step. Normally, all variables in the data step are set to missing at the start of each iteration of the data step.What is the order of evaluation of the comparison operators: + - * / ** ()?(), **, *, /, +, - 29. How could you generate test data with no input data? Using Data Null and put statement 30. How do you debug and test your SAS programs? Using Obs=0 and systems options to trace the program execution in log. 31. What can you learn from the SAS log when debugging? It will display the execution of whole program and the logic. It will also display the error with line number so that you can and edit the program. 32. What is the purpose of _error_? It has only to values, which are 1 for error and 0 for no error. 33. How can you put a "trace" in your program? By using ODS TRACE ON 34. How does SAS handle missing values in: assignment statements, functions, a merge, an update, sort order, formats, PROCs? Missing values will be assigned as missing in Assignment statement. Sort order treats missing as second smallest followed by underscore. 35. How do you test for missing values? Using Subset functions like IF then Else, Where and Select. 36. How are numeric and character missing values represented internally? Character as Blank or “ and Numeric as. 37. Which date functions advances a date time or date/time value by a given interval? INTNX. 38. In the flow of DATA step processing, what is the first action in a typical DATA Step? When you submit a DATA step, SAS processes the DATA step and then creates a new SAS data set.( creation of input buffer and PDV) Compilation Phase Execution Phase 39. What are SAS/ACCESS and SAS/CONNECT? SAS/Access only process through the databases like Oracle, SQL-server, Ms-Access etc. SAS/Connect only use Server connection. 40. What is the one statement to set the criteria of data that can be coded in any step? OPTIONS Statement, Label statement, Keep / Drop statements. 41. What is the purpose of using the N=PS option? The N=PS option creates a buffer in memory which is large enough to store PAGESIZE (PS) lines and enables a page to be formatted randomly prior to it being printed. 42. What are the scrubbing procedures in SAS? Proc Sort with nodupkey option, because it will eliminate the duplicate values. 43. What are the new features included in the new version of SAS? The main advantage of version9 is faster execution of applications and centralized access of data and support. There are lots of changes has been made in the version 9 when we compared with the version8. The following are the few:SAS version 9 supports Formats longer than 8 bytes & is not possible with version 8. Length for Numeric format allowed in version 9 is 32 where as 8 in version 8. Length for Character names in version 9 is 31 where as in version 8 is 32. Length for numeric informat in version 9 is 31, 8 in version 8. Length for character names is 30, 32 in version 8.3 new informats are available in version 9 to convert various date, time and datetime forms of data into a SAS date or SAS time. ·ANYDTDTEW. - Converts to a SAS date value ·ANYDTTMEW. - Converts to a SAS time value. ·ANYDTDTMW. -Converts to a SAS datetime value.CALL SYMPUTX Macro statement is added in the version 9 which creates a macro variable at execution time in the data step by · Trimming trailing blanks · Automatically converting numeric value to character. New ODS option (COLUMN OPTION) is included to create a multiple columns in the output. 44. WHAT DIFFERRENCE DID YOU FIND AMONG VERSION 6 8 AND 9 OF SAS. The SAS 9 Architecture is fundamentally different from any prior version of SAS. In the SAS 9 architecture, SAS relies on a new component, the Metadata Server, to provide an information layer between the programs and the data they access. Metadata, such as security permissions for SAS libraries and where the various SAS servers are running, are maintained in a common repository. 45. What has been your most common programming mistake? Missing semicolon and not checking log after submitting program, Not using debugging techniques and not using Fsview option vigorously. Name several ways to achieve efficiency in your program.Efficiency and performance strategies can be classified into 5 different areas. CPU time Data Storage Elapsed time Input/Output Memory CPU Time and Elapsed Time- Base line measurements 46. Few Examples for efficiency violations:Retaining unwanted datasets Not sub setting early to eliminate unwanted records. Efficiency improving techniques: Using KEEP and DROP statements to retain necessary variables. Use macros for reducing the code. Using IF-THEN/ELSE statements to process data programming. Use SQL procedure to reduce number of programming steps. Using of length statements to reduce the variable size for reducing the Data storage. Use of Data _NULL_ steps for processing null data sets for Data storage. 47. What other SAS products have you used and consider yourself proficient in using? Data _NULL_ statement, Proc Means, Proc Report, Proc tabulate, Proc freq and Proc print, Proc Univariate etc. What is the significance of the 'OF' in X=SUM (OF a1-a4, a6, a9);If don’t use the OF function it might not be interpreted as we expect. For example the function above calculates the sum of a1 minus a4 plus a6 and a9 and not the whole sum of a1 to a4 & a6 and a9. It is true for mean option also. 48. How to use IF THEN ELSE in PROC SQL? PROC SQL; SELECT WEIGHT, CASE WHEN WEIGHT BETWEEN 0 AND 50 THEN ’LOW’ WHEN WEIGHT BETWEEN 51 AND 70 THEN ’MEDIUM’ WHEN WEIGHT BETWEEN 71 AND 100 THEN ’HIGH’ ELSE ’VERY HIGH’ END AS NEWWEIGHT FROM HEALTH; QUIT; 49. How to remove duplicates using PROC SQL? Proc SQL noprint; Create Table inter.Merged1 as Select distinct * from inter.readin ; Quit; 50. How to count unique values by a grouping variable? You can use PROC SQL with COUNT(DISTINCT variable_name) to determine the number of unique values for a column. 51. What is the one statement to set the criteria of data that can be coded in any step? Options statement. 52. What is the effect of the OPTIONS statement ERRORS=1? The –ERROR- variable ha a value of 1 if there is an error in the data for that observation and 0 if it is not. 53. What do the SAS log messages "numeric values have been converted to character" mean? What are the implications? It implies that automatic conversion took place to make character functions possible. 54. Why is a STOP statement needed for the POINT= option on a SET statement? Because POINT= reads only the specified observations SAS cannot detect an end-of-file condition as it would if the file were being read sequentially. 55. How do you control the number of observations and/or variables read or written? FIRSTOBS and OBS option SAS Questions and Answers Pdf Download Read the full article
0 notes
t-baba · 8 years ago
Photo
Tumblr media
Pandas: The Swiss Army Knife for Your Data, Part 2
This is part two of a two-part tutorial about Pandas, the amazing Python data analytics toolkit. 
In part one, we covered the basic data types of Pandas: the series and the data frame. We imported and exported data, selected subsets of data, worked with metadata, and sorted the data. 
In this part, we'll continue our journey and deal with missing data, data manipulation, data merging, data grouping, time series, and plotting.
Dealing With Missing Values
One of the strongest points of pandas is its handling of missing values. It will not just crash and burn in the presence of missing data. When data is missing, pandas replaces it with numpy's np.nan (not a number), and it doesn't participate in any computation.
Let's reindex our data frame, adding more rows and columns, but without any new data. To make it interesting, we'll populate some values.
>>> df = pd.DataFrame(np.random.randn(5,2), index=index, columns=['a','b']) >>> new_index = df.index.append(pd.Index(['six'])) >>> new_columns = list(df.columns) + ['c'] >>> df = df.reindex(index=new_index, columns=new_columns) >>> df.loc['three'].c = 3 >>> df.loc['four'].c = 4 >>> df a b c one -0.042172 0.374922 NaN two -0.689523 1.411403 NaN three 0.332707 0.307561 3.0 four 0.426519 -0.425181 4.0 five -0.161095 -0.849932 NaN six NaN NaN NaN
Note that df.index.append() returns a new index and doesn't modify the existing index. Also, df.reindex() returns a new data frame that I assign back to the df variable.
At this point, our data frame has six rows. The last row is all NaNs, and all other rows except the third and the fourth have NaN in the "c" column. What can you do with missing data? Here are options:
Keep it (but it will not participate in computations).
Drop it (the result of the computation will not contain the missing data).
Replace it with a default value.
Keep the missing data --------------------- >>> df *= 2 >>> df a b c one -0.084345 0.749845 NaN two -1.379046 2.822806 NaN three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 five -0.322190 -1.699864 NaN six NaN NaN NaN Drop rows with missing data --------------------------- >>> df.dropna() a b c three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 Replace with default value -------------------------- >>> df.fillna(5) a b c one -0.084345 0.749845 5.0 two -1.379046 2.822806 5.0 three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 five -0.322190 -1.699864 5.0 six 5.000000 5.000000 5.0
If you just want to check if you have missing data in your data frame, use the isnull() method. This returns a boolean mask of your dataframe, which is True for missing values and False elsewhere.
>>> df.isnull() a b c one False False True two False False True three False False False four False False False five False False True six True True True
Manipulating Your Data
When you have a data frame, you often need to perform operations on the data. Let's start with a new data frame that has four rows and three columns of random integers between 1 and 9 (inclusive).
>>> df = pd.DataFrame(np.random.randint(1, 10, size=(4, 3)), columns=['a','b', 'c']) >>> df a b c 0 1 3 3 1 8 9 2 2 8 1 5 3 4 6 1
Now, you can start working on the data. Let's sum up all the columns and assign the result to the last row, and then sum all the rows (dimension 1) and assign to the last column:
>>> df.loc[3] = df.sum() >>> df a b c 0 1 3 3 1 8 9 2 2 8 1 5 3 21 19 11 >>> df.c = df.sum(1) >>> df a b c 0 1 3 7 1 8 9 19 2 8 1 14 3 21 19 51
You can also perform operations on the entire data frame. Here is an example of subtracting 3 from each and every cell:
>>> df -= 3 >>> df a b c 0 -2 0 4 1 5 6 16 2 5 -2 11 3 18 16 48
For total control, you can apply arbitrary functions:
>>> df.apply(lambda x: x ** 2 + 5 * x - 4) a b c 0 -10 -4 32 1 46 62 332 2 46 -10 172 3 410 332 2540
Merging Data
Another common scenario when working with data frames is combining and merging data frames (and series) together. Pandas, as usual, gives you different options. Let's create another data frame and explore the various options.
>>> df2 = df // 3 >>> df2 a b c 0 -1 0 1 1 1 2 5 2 1 -1 3 3 6 5 16
Concat
When using pd.concat, pandas simply concatenates all the rows of the provided parts in order. There is no alignment of indexes. See in the following example how duplicate index values are created:
>>> pd.concat([df, df2]) a b c 0 -2 0 4 1 5 6 16 2 5 -2 11 3 18 16 48 0 -1 0 1 1 1 2 5 2 1 -1 3 3 6 5 16
You can also concatenate columns by using the axis=1 argument:
>>> pd.concat([df[:2], df2], axis=1) a b c a b c 0 -2.0 0.0 4.0 -1 0 1 1 5.0 6.0 16.0 1 2 5 2 NaN NaN NaN 1 -1 3 3 NaN NaN NaN 6 5 16
Note that because the first data frame (I used only two rows) didn't have as many rows, the missing values were automatically populated with NaNs, which changed those column types from int to float.
It's possible to concatenate any number of data frames in one call.
Merge
The merge function behaves in a similar way to SQL join. It merges all the columns from rows that have similar keys. Note that it operates on two data frames only:
>>> df = pd.DataFrame(dict(key=['start', 'finish'],x=[4, 8])) >>> df key x 0 start 4 1 finish 8 >>> df2 = pd.DataFrame(dict(key=['start', 'finish'],y=[2, 18])) >>> df2 key y 0 start 2 1 finish 18 >>> pd.merge(df, df2, on='key') key x y 0 start 4 2 1 finish 8 18
Append
The data frame's append() method is a little shortcut. It functionally behaves like concat(), but saves some key strokes.
>>> df key x 0 start 4 1 finish 8 Appending one row using the append method() ------------------------------------------- >>> df.append(dict(key='middle', x=9), ignore_index=True) key x 0 start 4 1 finish 8 2 middle 9 Appending one row using the concat() ------------------------------------------- >>> pd.concat([df, pd.DataFrame(dict(key='middle', x=[9]))], ignore_index=True) key x 0 start 4 1 finish 8 2 middle 9
Grouping Your Data
Here is a data frame that contains the members and ages of two families: the Smiths and the Joneses. You can use the groupby() method to group data by last name and find information at the family level like the sum of ages and the mean age:
df = pd.DataFrame( dict(first='John Jim Jenny Jill Jack'.split(), last='Smith Jones Jones Smith Smith'.split(), age=[11, 13, 22, 44, 65])) >>> df.groupby('last').sum() age last Jones 35 Smith 120 >>> df.groupby('last').mean() age last Jones 17.5 Smith 40.0
Time Series
A lot of important data is time series data. Pandas has strong support for time series data starting with data ranges, going through localization and time conversion, and all the way to sophisticated frequency-based resampling.
The date_range() function can generate sequences of datetimes. Here is an example of generating a six-week period starting on 1 January 2017 using the UTC time zone.
>>> weeks = pd.date_range(start='1/1/2017', periods=6, freq='W', tz='UTC') >>> weeks DatetimeIndex(['2017-01-01', '2017-01-08', '2017-01-15', '2017-01-22', '2017-01-29', '2017-02-05'], dtype='datetime64[ns, UTC]', freq='W-SUN')
Adding a timestamp to your data frames, either as data column or as the index, is great for organizing and grouping your data by time. It also allows resampling. Here is an example of resampling every minute data as five-minute aggregations.
>>> minutes = pd.date_range(start='1/1/2017', periods=10, freq='1Min', tz='UTC') >>> ts = pd.Series(np.random.randn(len(minutes)), minutes) >>> ts 2017-01-01 00:00:00+00:00 1.866913 2017-01-01 00:01:00+00:00 2.157201 2017-01-01 00:02:00+00:00 -0.439932 2017-01-01 00:03:00+00:00 0.777944 2017-01-01 00:04:00+00:00 0.755624 2017-01-01 00:05:00+00:00 -2.150276 2017-01-01 00:06:00+00:00 3.352880 2017-01-01 00:07:00+00:00 -1.657432 2017-01-01 00:08:00+00:00 -0.144666 2017-01-01 00:09:00+00:00 -0.667059 Freq: T, dtype: float64 >>> ts.resample('5Min').mean() 2017-01-01 00:00:00+00:00 1.023550 2017-01-01 00:05:00+00:00 -0.253311
Plotting
Pandas supports plotting with matplotlib. Make sure it's installed: pip install matplotlib. To generate a plot, you can call the plot() of a series or a data frame. There are many options to control the plot, but the defaults work for simple visualization purposes. Here is how to generate a line graph and save it to a PDF file.
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2017', periods=1000)) ts = ts.cumsum() ax = ts.plot() fig = ax.get_figure() fig.savefig('plot.pdf')
Note that on macOS, Python must be installed as a framework for plotting with Pandas.
Conclusion
Pandas is a very broad data analytics framework. It has a simple object model with the concepts of series and data frame and a wealth of built-in functionality. You can compose and mix pandas functions and your own algorithms. 
Additionally, don’t hesitate to see what we have available for sale and for study in the marketplace, and don't hesitate to ask any questions and provide your valuable feedback using the feed below.
Data importing and exporting in pandas are very extensive too and ensure that you can integrate it easily into existing systems. If you're doing any data processing in Python, pandas belongs in your toolbox.
by Gigi Sayfan via Envato Tuts+ Code http://ift.tt/2gaPZ24
2 notes · View notes
alanajacksontx · 6 years ago
Text
Using Python to recover SEO site traffic (Part one)
Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.
The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.
Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.
This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!
Winners vs losers
Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.
When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.
I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.
A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.
Let’s walk over the steps to put this kind of analysis together in Python.
You can reference my carefully documented Google Colab notebook here.
Getting the data
We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.
Google Analytics Query Explorer provides the simplest approach to do this in Python.
Head on over to the Google Analytics Query Explorer
Click on the button at the top that says “Click here to Authorize” and follow the steps provided.
Use the dropdown menu to select the website you want to get data from.
Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.
Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.
Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.
Hit “Run Query” and let it run
Scroll down to the bottom of the page and look for the text box that says “API Query URI.”
Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”
At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as  the variable called token (make sure to paste it inside the quotes)
Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.
Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!
First, let’s define placeholder variables to pass to the API
metrics = “,”.join([“ga:users”,”ga:newUsers”])
dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])
segment = “gaid::-5”
# Required, please fill in with your own GA information example: ga:23322342
gaid = “ga:23322342”
# Example: string-of-text-here from step 8.2
token = “”
# Example https://www.example.com or http://example.org
base_site_url = “”
# You can change the start and end dates as you like
start = “2017-06-01”
end = “2018-06-30”
The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  “””Creates a generator that yields GA API data 
     in chunks of size `max_results`”””
  #build uri w/ params
  api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\
             “start-date={start}&end-date={end}&metrics={metrics}&”\
             “dimensions={dimensions}&segment={segment}&access_token={token}&”\
             “max-results={max_results}”
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get(“nextLink”, None):
    while data.get(“nextLink”):
      new_uri = data.get(“nextLink”)
      new_uri += “&access_token={token}”.format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data
In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.
import pandas as pd
def to_df(gadata):
  “””Takes in a generator from GAData() 
     creates a dataframe from the rows”””
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
    else:
      newdf = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
      df = df.append(newdf)
    print(“Gathered {} rows”.format(len(df)))
  return df
Now, we can call the functions to load the Google Analytics data.
data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
data = to_df(data)
Analyzing the data
Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.
data.head(5)
This displays the first five rows of the data frame.
Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.
First, let’s convert the date to a datetime object and the metrics to numeric values.
data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])
data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])
data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])
Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).
from urllib.parse import urlparse, urljoin
data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)
data[‘url’] = urljoin(base_site_url, data[‘path’])
Now the fun part begins.
The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.
The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.
We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.
The .sum() method then takes the sum of every other column in the data frame within each group.
For more information on these methods please see the Pandas documentation for groupby.
For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause
# Change this depending on your needs
MIDPOINT_DATE = “2017-12-15”
before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]
after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]
# Traffic totals before Shopify switch
totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\
                .groupby(“ga:landingPagePath”).sum()
totals_before = totals_before.reset_index()\
                .sort_values(“ga:newUsers”, ascending=False)
# Traffic totals after Shopify switch
totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\
               .groupby(“ga:landingPagePath”).sum()
totals_after = totals_after.reset_index()\
               .sort_values(“ga:newUsers”, ascending=False)
You can check the totals before and after with this code and double check with the Google Analytics numbers.
print(“Traffic Totals Before: “)
print(“Row count: “, len(totals_before))
print(“Traffic Totals After: “)
print(“Row count: “, len(totals_after))
Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.
We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on=”ga:landingPagePath”, 
                            right_on=”ga:landingPagePath”, 
                            suffixes=[“_after”, “_before”], 
                            how=”outer”)
change.fillna(0, inplace=True)
Difference and percentage change
Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.
Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.
change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]
change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]
winners = change[change[‘percent_change’] > 0]
losers = change[change[‘percent_change’] < 0]
no_change = change[change[‘percent_change’] == 0]
Sanity check
Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.
# Checking that the total traffic adds up
data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()
It should be True.
Results
Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.
losers.sort_values(“difference”).head(10)
You can do the same to review the winners and try to learn from them.
winners.sort_values(“difference”, ascending=False).head(10)
You can export the losing pages to a CSV or Excel using this.
losers.to_csv(“./losing-pages.csv”)
This seems like a lot of work to analyze just one site–and it is!
The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.
In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.
The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.
from IM Tips And Tricks https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/ from Rising Phoenix SEO https://risingphxseo.tumblr.com/post/182759232745
0 notes
kellykperez · 6 years ago
Text
Using Python to recover SEO site traffic (Part one)
Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.
The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.
Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.
This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!
Winners vs losers
Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.
When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.
I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.
A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.
Let’s walk over the steps to put this kind of analysis together in Python.
You can reference my carefully documented Google Colab notebook here.
Getting the data
We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.
Google Analytics Query Explorer provides the simplest approach to do this in Python.
Head on over to the Google Analytics Query Explorer
Click on the button at the top that says “Click here to Authorize” and follow the steps provided.
Use the dropdown menu to select the website you want to get data from.
Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.
Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.
Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.
Hit “Run Query” and let it run
Scroll down to the bottom of the page and look for the text box that says “API Query URI.”
Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”
At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as  the variable called token (make sure to paste it inside the quotes)
Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.
Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!
First, let’s define placeholder variables to pass to the API
metrics = “,”.join([“ga:users”,”ga:newUsers”])
dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])
segment = “gaid::-5”
# Required, please fill in with your own GA information example: ga:23322342
gaid = “ga:23322342”
# Example: string-of-text-here from step 8.2
token = “”
# Example https://www.example.com or http://example.org
base_site_url = “”
# You can change the start and end dates as you like
start = “2017-06-01”
end = “2018-06-30”
The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  “””Creates a generator that yields GA API data 
     in chunks of size `max_results`”””
  #build uri w/ params
  api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\
             “start-date={start}&end-date={end}&metrics={metrics}&”\
             “dimensions={dimensions}&segment={segment}&access_token={token}&”\
             “max-results={max_results}”
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get(“nextLink”, None):
    while data.get(“nextLink”):
      new_uri = data.get(“nextLink”)
      new_uri += “&access_token={token}”.format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data
In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.
import pandas as pd
def to_df(gadata):
  “””Takes in a generator from GAData() 
     creates a dataframe from the rows”””
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
    else:
      newdf = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
      df = df.append(newdf)
    print(“Gathered {} rows”.format(len(df)))
  return df
Now, we can call the functions to load the Google Analytics data.
data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
data = to_df(data)
Analyzing the data
Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.
data.head(5)
This displays the first five rows of the data frame.
Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.
First, let’s convert the date to a datetime object and the metrics to numeric values.
data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])
data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])
data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])
Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).
from urllib.parse import urlparse, urljoin
data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)
data[‘url’] = urljoin(base_site_url, data[‘path’])
Now the fun part begins.
The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.
The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.
We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.
The .sum() method then takes the sum of every other column in the data frame within each group.
For more information on these methods please see the Pandas documentation for groupby.
For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause
# Change this depending on your needs
MIDPOINT_DATE = “2017-12-15”
before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]
after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]
# Traffic totals before Shopify switch
totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\
                .groupby(“ga:landingPagePath”).sum()
totals_before = totals_before.reset_index()\
                .sort_values(“ga:newUsers”, ascending=False)
# Traffic totals after Shopify switch
totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\
               .groupby(“ga:landingPagePath”).sum()
totals_after = totals_after.reset_index()\
               .sort_values(“ga:newUsers”, ascending=False)
You can check the totals before and after with this code and double check with the Google Analytics numbers.
print(“Traffic Totals Before: “)
print(“Row count: “, len(totals_before))
print(“Traffic Totals After: “)
print(“Row count: “, len(totals_after))
Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.
We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on=”ga:landingPagePath”, 
                            right_on=”ga:landingPagePath”, 
                            suffixes=[“_after”, “_before”], 
                            how=”outer”)
change.fillna(0, inplace=True)
Difference and percentage change
Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.
Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.
change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]
change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]
winners = change[change[‘percent_change’] > 0]
losers = change[change[‘percent_change’] < 0]
no_change = change[change[‘percent_change’] == 0]
Sanity check
Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.
# Checking that the total traffic adds up
data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()
It should be True.
Results
Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.
losers.sort_values(“difference”).head(10)
You can do the same to review the winners and try to learn from them.
winners.sort_values(“difference”, ascending=False).head(10)
You can export the losing pages to a CSV or Excel using this.
losers.to_csv(“./losing-pages.csv”)
This seems like a lot of work to analyze just one site–and it is!
The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.
In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.
The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.
source https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/ from Rising Phoenix SEO http://risingphoenixseo.blogspot.com/2019/02/using-python-to-recover-seo-site.html
0 notes
evaaguilaus · 6 years ago
Text
Using Python to recover SEO site traffic (Part one)
Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.
The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.
Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.
This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!
Winners vs losers
Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.
When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.
I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.
A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.
Let’s walk over the steps to put this kind of analysis together in Python.
You can reference my carefully documented Google Colab notebook here.
Getting the data
We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.
Google Analytics Query Explorer provides the simplest approach to do this in Python.
Head on over to the Google Analytics Query Explorer
Click on the button at the top that says “Click here to Authorize” and follow the steps provided.
Use the dropdown menu to select the website you want to get data from.
Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.
Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.
Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.
Hit “Run Query” and let it run
Scroll down to the bottom of the page and look for the text box that says “API Query URI.”
Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”
At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as  the variable called token (make sure to paste it inside the quotes)
Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.
Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!
First, let’s define placeholder variables to pass to the API
metrics = “,”.join([“ga:users”,”ga:newUsers”])
dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])
segment = “gaid::-5”
# Required, please fill in with your own GA information example: ga:23322342
gaid = “ga:23322342”
# Example: string-of-text-here from step 8.2
token = “”
# Example https://www.example.com or http://example.org
base_site_url = “”
# You can change the start and end dates as you like
start = “2017-06-01”
end = “2018-06-30”
The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  “””Creates a generator that yields GA API data 
     in chunks of size `max_results`”””
  #build uri w/ params
  api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\
             “start-date={start}&end-date={end}&metrics={metrics}&”\
             “dimensions={dimensions}&segment={segment}&access_token={token}&”\
             “max-results={max_results}”
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get(“nextLink”, None):
    while data.get(“nextLink”):
      new_uri = data.get(“nextLink”)
      new_uri += “&access_token={token}”.format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data
In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.
import pandas as pd
def to_df(gadata):
  “””Takes in a generator from GAData() 
     creates a dataframe from the rows”””
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
    else:
      newdf = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
      df = df.append(newdf)
    print(“Gathered {} rows”.format(len(df)))
  return df
Now, we can call the functions to load the Google Analytics data.
data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
data = to_df(data)
Analyzing the data
Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.
data.head(5)
This displays the first five rows of the data frame.
Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.
First, let’s convert the date to a datetime object and the metrics to numeric values.
data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])
data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])
data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])
Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).
from urllib.parse import urlparse, urljoin
data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)
data[‘url’] = urljoin(base_site_url, data[‘path’])
Now the fun part begins.
The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.
The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.
We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.
The .sum() method then takes the sum of every other column in the data frame within each group.
For more information on these methods please see the Pandas documentation for groupby.
For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause
# Change this depending on your needs
MIDPOINT_DATE = “2017-12-15”
before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]
after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]
# Traffic totals before Shopify switch
totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\
                .groupby(“ga:landingPagePath”).sum()
totals_before = totals_before.reset_index()\
                .sort_values(“ga:newUsers”, ascending=False)
# Traffic totals after Shopify switch
totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\
               .groupby(“ga:landingPagePath”).sum()
totals_after = totals_after.reset_index()\
               .sort_values(“ga:newUsers”, ascending=False)
You can check the totals before and after with this code and double check with the Google Analytics numbers.
print(“Traffic Totals Before: “)
print(“Row count: “, len(totals_before))
print(“Traffic Totals After: “)
print(“Row count: “, len(totals_after))
Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.
We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on=”ga:landingPagePath”, 
                            right_on=”ga:landingPagePath”, 
                            suffixes=[“_after”, “_before”], 
                            how=”outer”)
change.fillna(0, inplace=True)
Difference and percentage change
Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.
Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.
change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]
change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]
winners = change[change[‘percent_change’] > 0]
losers = change[change[‘percent_change’] < 0]
no_change = change[change[‘percent_change’] == 0]
Sanity check
Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.
# Checking that the total traffic adds up
data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()
It should be True.
Results
Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.
losers.sort_values(“difference”).head(10)
You can do the same to review the winners and try to learn from them.
winners.sort_values(“difference”, ascending=False).head(10)
You can export the losing pages to a CSV or Excel using this.
losers.to_csv(“./losing-pages.csv”)
This seems like a lot of work to analyze just one site–and it is!
The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.
In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.
The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/
0 notes
srasamua · 6 years ago
Text
Using Python to recover SEO site traffic (Part one)
Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.
The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.
Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.
This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!
Winners vs losers
Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.
When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.
I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.
A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.
Let’s walk over the steps to put this kind of analysis together in Python.
You can reference my carefully documented Google Colab notebook here.
Getting the data
We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.
Google Analytics Query Explorer provides the simplest approach to do this in Python.
Head on over to the Google Analytics Query Explorer
Click on the button at the top that says “Click here to Authorize” and follow the steps provided.
Use the dropdown menu to select the website you want to get data from.
Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.
Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.
Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.
Hit “Run Query” and let it run
Scroll down to the bottom of the page and look for the text box that says “API Query URI.”
Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”
At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as  the variable called token (make sure to paste it inside the quotes)
Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.
Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!
First, let’s define placeholder variables to pass to the API
metrics = “,”.join([“ga:users”,”ga:newUsers”])
dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])
segment = “gaid::-5”
# Required, please fill in with your own GA information example: ga:23322342
gaid = “ga:23322342”
# Example: string-of-text-here from step 8.2
token = “”
# Example https://www.example.com or http://example.org
base_site_url = “”
# You can change the start and end dates as you like
start = “2017-06-01”
end = “2018-06-30”
The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  “””Creates a generator that yields GA API data 
     in chunks of size `max_results`”””
  #build uri w/ params
  api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\
             “start-date={start}&end-date={end}&metrics={metrics}&”\
             “dimensions={dimensions}&segment={segment}&access_token={token}&”\
             “max-results={max_results}”
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get(“nextLink”, None):
    while data.get(“nextLink”):
      new_uri = data.get(“nextLink”)
      new_uri += “&access_token={token}”.format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data
In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.
import pandas as pd
def to_df(gadata):
  “””Takes in a generator from GAData() 
     creates a dataframe from the rows”””
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
    else:
      newdf = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
      df = df.append(newdf)
    print(“Gathered {} rows”.format(len(df)))
  return df
Now, we can call the functions to load the Google Analytics data.
data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
data = to_df(data)
Analyzing the data
Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.
data.head(5)
This displays the first five rows of the data frame.
Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.
First, let’s convert the date to a datetime object and the metrics to numeric values.
data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])
data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])
data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])
Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).
from urllib.parse import urlparse, urljoin
data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)
data[‘url’] = urljoin(base_site_url, data[‘path’])
Now the fun part begins.
The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.
The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.
We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.
The .sum() method then takes the sum of every other column in the data frame within each group.
For more information on these methods please see the Pandas documentation for groupby.
For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause
# Change this depending on your needs
MIDPOINT_DATE = “2017-12-15”
before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]
after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]
# Traffic totals before Shopify switch
totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\
                .groupby(“ga:landingPagePath”).sum()
totals_before = totals_before.reset_index()\
                .sort_values(“ga:newUsers”, ascending=False)
# Traffic totals after Shopify switch
totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\
               .groupby(“ga:landingPagePath”).sum()
totals_after = totals_after.reset_index()\
               .sort_values(“ga:newUsers”, ascending=False)
You can check the totals before and after with this code and double check with the Google Analytics numbers.
print(“Traffic Totals Before: “)
print(“Row count: “, len(totals_before))
print(“Traffic Totals After: “)
print(“Row count: “, len(totals_after))
Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.
We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on=”ga:landingPagePath”, 
                            right_on=”ga:landingPagePath”, 
                            suffixes=[“_after”, “_before”], 
                            how=”outer”)
change.fillna(0, inplace=True)
Difference and percentage change
Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.
Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.
change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]
change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]
winners = change[change[‘percent_change’] > 0]
losers = change[change[‘percent_change’] < 0]
no_change = change[change[‘percent_change’] == 0]
Sanity check
Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.
# Checking that the total traffic adds up
data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()
It should be True.
Results
Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.
losers.sort_values(“difference”).head(10)
You can do the same to review the winners and try to learn from them.
winners.sort_values(“difference”, ascending=False).head(10)
You can export the losing pages to a CSV or Excel using this.
losers.to_csv(“./losing-pages.csv”)
This seems like a lot of work to analyze just one site–and it is!
The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.
In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.
The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/
0 notes
bambiguertinus · 6 years ago
Text
Using Python to recover SEO site traffic (Part one)
Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.
The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.
Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.
This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!
Winners vs losers
Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.
When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.
I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.
A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.
Let’s walk over the steps to put this kind of analysis together in Python.
You can reference my carefully documented Google Colab notebook here.
Getting the data
We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.
Google Analytics Query Explorer provides the simplest approach to do this in Python.
Head on over to the Google Analytics Query Explorer
Click on the button at the top that says “Click here to Authorize” and follow the steps provided.
Use the dropdown menu to select the website you want to get data from.
Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.
Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.
Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.
Hit “Run Query” and let it run
Scroll down to the bottom of the page and look for the text box that says “API Query URI.”
Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”
At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as  the variable called token (make sure to paste it inside the quotes)
Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.
Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!
First, let’s define placeholder variables to pass to the API
metrics = “,”.join([“ga:users”,”ga:newUsers”])
dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])
segment = “gaid::-5”
# Required, please fill in with your own GA information example: ga:23322342
gaid = “ga:23322342”
# Example: string-of-text-here from step 8.2
token = “”
# Example https://www.example.com or http://example.org
base_site_url = “”
# You can change the start and end dates as you like
start = “2017-06-01”
end = “2018-06-30”
The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  “””Creates a generator that yields GA API data 
     in chunks of size `max_results`”””
  #build uri w/ params
  api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\
             “start-date={start}&end-date={end}&metrics={metrics}&”\
             “dimensions={dimensions}&segment={segment}&access_token={token}&”\
             “max-results={max_results}”
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get(“nextLink”, None):
    while data.get(“nextLink”):
      new_uri = data.get(“nextLink”)
      new_uri += “&access_token={token}”.format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data
In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.
import pandas as pd
def to_df(gadata):
  “””Takes in a generator from GAData() 
     creates a dataframe from the rows”””
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
    else:
      newdf = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
      df = df.append(newdf)
    print(“Gathered {} rows”.format(len(df)))
  return df
Now, we can call the functions to load the Google Analytics data.
data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
data = to_df(data)
Analyzing the data
Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.
data.head(5)
This displays the first five rows of the data frame.
Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.
First, let’s convert the date to a datetime object and the metrics to numeric values.
data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])
data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])
data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])
Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).
from urllib.parse import urlparse, urljoin
data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)
data[‘url’] = urljoin(base_site_url, data[‘path’])
Now the fun part begins.
The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.
The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.
We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.
The .sum() method then takes the sum of every other column in the data frame within each group.
For more information on these methods please see the Pandas documentation for groupby.
For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause
# Change this depending on your needs
MIDPOINT_DATE = “2017-12-15”
before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]
after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]
# Traffic totals before Shopify switch
totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\
                .groupby(“ga:landingPagePath”).sum()
totals_before = totals_before.reset_index()\
                .sort_values(“ga:newUsers”, ascending=False)
# Traffic totals after Shopify switch
totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\
               .groupby(“ga:landingPagePath”).sum()
totals_after = totals_after.reset_index()\
               .sort_values(“ga:newUsers”, ascending=False)
You can check the totals before and after with this code and double check with the Google Analytics numbers.
print(“Traffic Totals Before: “)
print(“Row count: “, len(totals_before))
print(“Traffic Totals After: “)
print(“Row count: “, len(totals_after))
Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.
We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on=”ga:landingPagePath”, 
                            right_on=”ga:landingPagePath”, 
                            suffixes=[“_after”, “_before”], 
                            how=”outer”)
change.fillna(0, inplace=True)
Difference and percentage change
Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.
Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.
change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]
change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]
winners = change[change[‘percent_change’] > 0]
losers = change[change[‘percent_change’] < 0]
no_change = change[change[‘percent_change’] == 0]
Sanity check
Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.
# Checking that the total traffic adds up
data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()
It should be True.
Results
Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.
losers.sort_values(“difference”).head(10)
You can do the same to review the winners and try to learn from them.
winners.sort_values(“difference”, ascending=False).head(10)
You can export the losing pages to a CSV or Excel using this.
losers.to_csv(“./losing-pages.csv”)
This seems like a lot of work to analyze just one site–and it is!
The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.
In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.
The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/
0 notes
oscarkruegerus · 6 years ago
Text
Using Python to recover SEO site traffic (Part one)
Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.
The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.
Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.
This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!
Winners vs losers
Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.
When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.
I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.
A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.
Let’s walk over the steps to put this kind of analysis together in Python.
You can reference my carefully documented Google Colab notebook here.
Getting the data
We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.
Google Analytics Query Explorer provides the simplest approach to do this in Python.
Head on over to the Google Analytics Query Explorer
Click on the button at the top that says “Click here to Authorize” and follow the steps provided.
Use the dropdown menu to select the website you want to get data from.
Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.
Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.
Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.
Hit “Run Query” and let it run
Scroll down to the bottom of the page and look for the text box that says “API Query URI.”
Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”
At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as  the variable called token (make sure to paste it inside the quotes)
Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.
Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!
First, let’s define placeholder variables to pass to the API
metrics = “,”.join([“ga:users”,”ga:newUsers”])
dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])
segment = “gaid::-5”
# Required, please fill in with your own GA information example: ga:23322342
gaid = “ga:23322342”
# Example: string-of-text-here from step 8.2
token = “”
# Example https://www.example.com or http://example.org
base_site_url = “”
# You can change the start and end dates as you like
start = “2017-06-01”
end = “2018-06-30”
The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  “””Creates a generator that yields GA API data 
     in chunks of size `max_results`”””
  #build uri w/ params
  api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\
             “start-date={start}&end-date={end}&metrics={metrics}&”\
             “dimensions={dimensions}&segment={segment}&access_token={token}&”\
             “max-results={max_results}”
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get(“nextLink”, None):
    while data.get(“nextLink”):
      new_uri = data.get(“nextLink”)
      new_uri += “&access_token={token}”.format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data
In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.
import pandas as pd
def to_df(gadata):
  “””Takes in a generator from GAData() 
     creates a dataframe from the rows”””
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
    else:
      newdf = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
      df = df.append(newdf)
    print(“Gathered {} rows”.format(len(df)))
  return df
Now, we can call the functions to load the Google Analytics data.
data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
data = to_df(data)
Analyzing the data
Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.
data.head(5)
This displays the first five rows of the data frame.
Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.
First, let’s convert the date to a datetime object and the metrics to numeric values.
data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])
data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])
data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])
Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).
from urllib.parse import urlparse, urljoin
data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)
data[‘url’] = urljoin(base_site_url, data[‘path’])
Now the fun part begins.
The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.
The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.
We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.
The .sum() method then takes the sum of every other column in the data frame within each group.
For more information on these methods please see the Pandas documentation for groupby.
For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause
# Change this depending on your needs
MIDPOINT_DATE = “2017-12-15”
before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]
after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]
# Traffic totals before Shopify switch
totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\
                .groupby(“ga:landingPagePath”).sum()
totals_before = totals_before.reset_index()\
                .sort_values(“ga:newUsers”, ascending=False)
# Traffic totals after Shopify switch
totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\
               .groupby(“ga:landingPagePath”).sum()
totals_after = totals_after.reset_index()\
               .sort_values(“ga:newUsers”, ascending=False)
You can check the totals before and after with this code and double check with the Google Analytics numbers.
print(“Traffic Totals Before: “)
print(“Row count: “, len(totals_before))
print(“Traffic Totals After: “)
print(“Row count: “, len(totals_after))
Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.
We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on=”ga:landingPagePath”, 
                            right_on=”ga:landingPagePath”, 
                            suffixes=[“_after”, “_before”], 
                            how=”outer”)
change.fillna(0, inplace=True)
Difference and percentage change
Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.
Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.
change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]
change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]
winners = change[change[‘percent_change’] > 0]
losers = change[change[‘percent_change’] < 0]
no_change = change[change[‘percent_change’] == 0]
Sanity check
Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.
# Checking that the total traffic adds up
data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()
It should be True.
Results
Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.
losers.sort_values(“difference”).head(10)
You can do the same to review the winners and try to learn from them.
winners.sort_values(“difference”, ascending=False).head(10)
You can export the losing pages to a CSV or Excel using this.
losers.to_csv(“./losing-pages.csv”)
This seems like a lot of work to analyze just one site–and it is!
The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.
In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.
The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/
0 notes
sheilalmartinia · 6 years ago
Text
Using Python to recover SEO site traffic (Part one)
Helping a client recover from a bad redesign or site migration is probably one of the most critical jobs you can face as an SEO.
The traditional approach of conducting a full forensic SEO audit works well most of the time, but what if there was a way to speed things up? You could potentially save your client a lot of money in opportunity cost.
Last November, I spoke at TechSEO Boost and presented a technique my team and I regularly use to analyze traffic drops. It allows us to pinpoint this painful problem quickly and with surgical precision. As far as I know, there are no tools that currently implement this technique. I coded this solution using Python.
This is the first part of a three-part series. In part two, we will manually group the pages using regular expressions and in part three we will group them automatically using machine learning techniques. Let’s walk over part one and have some fun!
Winners vs losers
Last June we signed up a client that moved from Ecommerce V3 to Shopify and the SEO traffic took a big hit. The owner set up 301 redirects between the old and new sites but made a number of unwise changes like merging a large number of categories and rewriting titles during the move.
When traffic drops, some parts of the site underperform while others don’t. I like to isolate them in order to 1) focus all efforts on the underperforming parts, and 2) learn from the parts that are doing well.
I call this analysis the “Winners vs Losers” analysis. Here, winners are the parts that do well, and losers the ones that do badly.
A visualization of the analysis looks like the chart above. I was able to narrow down the issue to the category pages (Collection pages) and found that the main issue was caused by the site owner merging and eliminating too many categories during the move.
Let’s walk over the steps to put this kind of analysis together in Python.
You can reference my carefully documented Google Colab notebook here.
Getting the data
We want to programmatically compare two separate time frames in Google Analytics (before and after the traffic drop), and we’re going to use the Google Analytics API to do it.
Google Analytics Query Explorer provides the simplest approach to do this in Python.
Head on over to the Google Analytics Query Explorer
Click on the button at the top that says “Click here to Authorize” and follow the steps provided.
Use the dropdown menu to select the website you want to get data from.
Fill in the “metrics” parameter with “ga:newUsers” in order to track new visits.
Complete the “dimensions” parameter with “ga:landingPagePath” in order to get the page URLs.
Fill in the “segment” parameter with “gaid::-5” in order to track organic search visits.
Hit “Run Query” and let it run
Scroll down to the bottom of the page and look for the text box that says “API Query URI.”
Check the box underneath it that says “Include current access_token in the Query URI (will expire in ~60 minutes).”
At the end of the URL in the text box you should now see access_token=string-of-text-here. You will use this string of text in the code snippet below as  the variable called token (make sure to paste it inside the quotes)
Now, scroll back up to where we built the query, and look for the parameter that was filled in for you called “ids.” You will use this in the code snippet below as the variable called “gaid.” Again, it should go inside the quotes.
Run the cell once you’ve filled in the gaid and token variables to instantiate them, and we’re good to go!
First, let’s define placeholder variables to pass to the API
metrics = “,”.join([“ga:users”,”ga:newUsers”])
dimensions = “,”.join([“ga:landingPagePath”, “ga:date”])
segment = “gaid::-5”
# Required, please fill in with your own GA information example: ga:23322342
gaid = “ga:23322342”
# Example: string-of-text-here from step 8.2
token = “”
# Example https://www.example.com or http://example.org
base_site_url = “”
# You can change the start and end dates as you like
start = “2017-06-01”
end = “2018-06-30”
The first function combines the placeholder variables we filled in above with an API URL to get Google Analytics data. We make additional API requests and merge them in case the results exceed the 10,000 limit.
def GAData(gaid, start, end, metrics, dimensions, 
           segment, token, max_results=10000):
  “””Creates a generator that yields GA API data 
     in chunks of size `max_results`”””
  #build uri w/ params
  api_uri = “https://www.googleapis.com/analytics/v3/data/ga?ids={gaid}&”\
             “start-date={start}&end-date={end}&metrics={metrics}&”\
             “dimensions={dimensions}&segment={segment}&access_token={token}&”\
             “max-results={max_results}”
  # insert uri params
  api_uri = api_uri.format(
      gaid=gaid,
      start=start,
      end=end,
      metrics=metrics,
      dimensions=dimensions,
      segment=segment,
      token=token,
      max_results=max_results
  )
  # Using yield to make a generator in an
  # attempt to be memory efficient, since data is downloaded in chunks
  r = requests.get(api_uri)
  data = r.json()
  yield data
  if data.get(“nextLink”, None):
    while data.get(“nextLink”):
      new_uri = data.get(“nextLink”)
      new_uri += “&access_token={token}”.format(token=token)
      r = requests.get(new_uri)
      data = r.json()
      yield data
In the second function, we load the Google Analytics Query Explorer API response into a pandas DataFrame to simplify our analysis.
import pandas as pd
def to_df(gadata):
  “””Takes in a generator from GAData() 
     creates a dataframe from the rows”””
  df = None
  for data in gadata:
    if df is None:
      df = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
    else:
      newdf = pd.DataFrame(
          data[‘rows’], 
          columns=[x[‘name’] for x in data[‘columnHeaders’]]
      )
      df = df.append(newdf)
    print(“Gathered {} rows”.format(len(df)))
  return df
Now, we can call the functions to load the Google Analytics data.
data = GAData(gaid=gaid, metrics=metrics, start=start, 
                end=end, dimensions=dimensions, segment=segment, 
                token=token)
data = to_df(data)
Analyzing the data
Let’s start by just getting a look at the data. We’ll use the .head() method of DataFrames to take a look at the first few rows. Think of this as glancing at only the top few rows of an Excel spreadsheet.
data.head(5)
This displays the first five rows of the data frame.
Most of the data is not in the right format for proper analysis, so let’s perform some data transformations.
First, let’s convert the date to a datetime object and the metrics to numeric values.
data[‘ga:date’] = pd.to_datetime(data[‘ga:date’])
data[‘ga:users’] = pd.to_numeric(data[‘ga:users’])
data[‘ga:newUsers’] = pd.to_numeric(data[‘ga:newUsers’])
Next, we will need the landing page URL, which are relative and include URL parameters in two additional formats: 1) as absolute urls, and 2) as relative paths (without the URL parameters).
from urllib.parse import urlparse, urljoin
data[‘path’] = data[‘ga:landingPagePath’].apply(lambda x: urlparse(x).path)
data[‘url’] = urljoin(base_site_url, data[‘path’])
Now the fun part begins.
The goal of our analysis is to see which pages lost traffic after a particular date–compared to the period before that date–and which gained traffic after that date.
The example date chosen below corresponds to the exact midpoint of our start and end variables used above to gather the data, so that the data both before and after the date is similarly sized.
We begin the analysis by grouping each URL together by their path and adding up the newUsers for each URL. We do this with the built-in pandas method: .groupby(), which takes a column name as an input and groups together each unique value in that column.
The .sum() method then takes the sum of every other column in the data frame within each group.
For more information on these methods please see the Pandas documentation for groupby.
For those who might be familiar with SQL, this is analogous to a GROUP BY clause with a SUM in the select clause
# Change this depending on your needs
MIDPOINT_DATE = “2017-12-15”
before = data[data[‘ga:date’] < pd.to_datetime(MIDPOINT_DATE)]
after = data[data[‘ga:date’] >= pd.to_datetime(MIDPOINT_DATE)]
# Traffic totals before Shopify switch
totals_before = before[[“ga:landingPagePath”, “ga:newUsers”]]\
                .groupby(“ga:landingPagePath”).sum()
totals_before = totals_before.reset_index()\
                .sort_values(“ga:newUsers”, ascending=False)
# Traffic totals after Shopify switch
totals_after = after[[“ga:landingPagePath”, “ga:newUsers”]]\
               .groupby(“ga:landingPagePath”).sum()
totals_after = totals_after.reset_index()\
               .sort_values(“ga:newUsers”, ascending=False)
You can check the totals before and after with this code and double check with the Google Analytics numbers.
print(“Traffic Totals Before: “)
print(“Row count: “, len(totals_before))
print(“Traffic Totals After: “)
print(“Row count: “, len(totals_after))
Next up we merge the two data frames, so that we have a single column corresponding to the URL, and two columns corresponding to the totals before and after the date.
We have different options when merging as illustrated above. Here, we use an “outer” merge, because even if a URL didn’t show up in the “before” period, we still want it to be a part of this merged dataframe. We’ll fill in the blanks with zeros after the merge.
# Comparing pages from before and after the switch
change = totals_after.merge(totals_before, 
                            left_on=”ga:landingPagePath”, 
                            right_on=”ga:landingPagePath”, 
                            suffixes=[“_after”, “_before”], 
                            how=”outer”)
change.fillna(0, inplace=True)
Difference and percentage change
Pandas dataframes make simple calculations on whole columns easy. We can take the difference of two columns and divide two columns and it will perform that operation on every row for us. We will take the difference of the two totals columns, and divide by the “before” column to get the percent change before and after out midpoint date.
Using this percent_change column we can then filter our dataframe to get the winners, the losers and those URLs with no change.
change[‘difference’] = change[‘ga:newUsers_after’] – change[‘ga:newUsers_before’]
change[‘percent_change’] = change[‘difference’] / change[‘ga:newUsers_before’]
winners = change[change[‘percent_change’] > 0]
losers = change[change[‘percent_change’] < 0]
no_change = change[change[‘percent_change’] == 0]
Sanity check
Finally, we do a quick sanity check to make sure that all the traffic from the original data frame is still accounted for after all of our analysis. To do this, we simply take the sum of all traffic for both the original data frame and the two columns of our change dataframe.
# Checking that the total traffic adds up
data[‘ga:newUsers’].sum() == change[[‘ga:newUsers_after’, ‘ga:newUsers_before’]].sum().sum()
It should be True.
Results
Sorting by the difference in our losers data frame, and taking the .head(10), we can see the top 10 losers in our analysis. In other words, these pages lost the most total traffic between the two periods before and after the midpoint date.
losers.sort_values(“difference”).head(10)
You can do the same to review the winners and try to learn from them.
winners.sort_values(“difference”, ascending=False).head(10)
You can export the losing pages to a CSV or Excel using this.
losers.to_csv(“./losing-pages.csv”)
This seems like a lot of work to analyze just one site–and it is!
The magic happens when you reuse this code on new clients and simply need to replace the placeholder variables at the top of the script.
In part two, we will make the output more useful by grouping the losing (and winning) pages by their types to get the chart I included above.
The post Using Python to recover SEO site traffic (Part one) appeared first on Search Engine Watch.
from Search Engine Watch https://searchenginewatch.com/2019/02/06/using-python-to-recover-seo-site-traffic-part-one/
0 notes