#javascript sum table row values
Explore tagged Tumblr posts
Text
How to calculate sum of column in jquery
How to calculate sum of column in jquery
In this article, we learn about how to calculate sum of column in jquery, or you can also say how to calculate the value when giving input to the column. In this article we will jquery version 3.X and bootstrap version 4 to give some feel to our form. You will learn about each function() , parse column data in this article. Below is the basic html code which includes cdn in their…
View On WordPress
#"calculate sum total of column in jquery#calculate sum total of column in jquery how to calculate total in jquery#how to calculate sum of column values in javascript#how to calculate total in jquery#how to dynamically add row and calculate sum using jquery#javascript sum table row values#jquery datatable sum of particular column#update column value in html table using jquery#using jquery to perform calculations in a table
0 notes
Text
Databases: how they work, and a brief history
My twitter-friend Simon had a simple question that contained much complexity: how do databases work?
Ok, so databases really confuse me, like how do databases even work?
— Simon Legg (@simonleggsays) November 18, 2019
I don't have a job at the moment, and I really love databases and also teaching things to web developers, so this was a perfect storm for me:
To what level of detail would you like an answer? I love databases.
— Laurie Voss (@seldo) November 18, 2019
The result was an absurdly long thread of 70+ tweets, in which I expounded on the workings and history of databases as used by modern web developers, and Simon chimed in on each tweet with further questions and requests for clarification. The result of this collaboration was a super fun tiny explanation of databases which many people said they liked, so here it is, lightly edited for clarity.
What is a database?
Let's start at the very most basic thing, the words we're using: a "database" literally just means "a structured collection of data". Almost anything meets this definition – an object in memory, an XML file, a list in HTML. It's super broad, so we call some radically different things "databases".
The thing people use all the time is, formally, a Database Management System, abbreviated to DBMS. This is a piece of software that handles access to the pile of data. Technically one DBMS can manage multiple databases (MySQL and postgres both do this) but often a DBMS will have just one database in it.
Because it's so frequent that the DBMS has one DB in it we often call a DBMS a "database". So part of the confusion around databases for people new to them is because we call so many things the same word! But it doesn't really matter, you can call an DBMS a "database" and everyone will know what you mean. MySQL, Redis, Postgres, RedShift, Oracle etc. are all DBMS.
So now we have a mental model of a "database", really a DBMS: it is a piece of software that manages access to a pile of structured data for you. DBMSes are often written in C or C++, but it can be any programming language; there are databases written in Erlang and JavaScript. One of the key differences between DBMSes is how they structure the data.
Relational databases
Relational databases, also called RDBMS, model data as a table, like you'd see in a spreadsheet. On disk this can be as simple as comma-separated values: one row per line, commas between columns, e.g. a classic example is a table of fruits:
apple,10,5.00 orange,5,6.50
The DBMS knows the first column is the name, the second is the number of fruits, the third is the price. Sometimes it will store that information in a different database! Sometimes the metadata about what the columns are will be in the database file itself. Because it knows about the columns, it can handle niceties for you: for example, the first column is a string, the second is an integer, the third is dollar values. It can use that to make sure it returns those columns to you correctly formatted, and it can also store numbers more efficiently than just strings of digits.
In reality a modern database is doing a whole bunch of far more clever optimizations than just comma separated values but it's a mental model of what's going on that works fine. The data all lives on disk, often as one big file, and the DBMS caches parts of it in memory for speed. Sometimes it has different files for the data and the metadata, or for indexes that make it easier to find things quickly, but we can safely ignore those details.
RDBMS are older, so they date from a time when memory was really expensive, so they usually optimize for keeping most things on disk and only put some stuff in memory. But they don't have to: some RDBMS keep everything in memory and never write to disk. That makes them much faster!
Is it still a database if all the structured data stays in memory? Sure. It's a pile of structured data. Nothing in that definition says a disk needs to be involved.
So what does the "relational" part of RDBMS mean? RDBMS have multiple tables of data, and they can relate different tables to each other. For instance, imagine a new table called "Farmers":
IDName 1bob 2susan
and we modify the Fruits table:
Farmer IDFruitQuantityPrice 1apple105.00 1orange56.50 2apple206.00 2orange14.75
.dbTable { border: 1px solid black; } .dbTable thead td { background-color: #eee; } .dbTable td { padding: 0.3em; }
The Farmers table gives each farmer a name and an ID. The Fruits table now has a column that gives the Farmer ID, so you can see which farmer has which fruit at which price.
Why's that helpful? Two reasons: space and time. Space because it reduces data duplication. Remember, these were invented when disks were expensive and slow! Storing the data this way lets you only list "susan" once no matter how many fruits she has. If she had a hundred kinds of fruit you'd be saving quite a lot of storage by not repeating her name over and over. The time reason comes in if you want to change Susan's name. If you repeated her name hundreds of times you would have to do a write to disk for each one (and writes were very slow at the time this was all designed). That would take a long time, plus there's a chance you could miss one somewhere and suddenly Susan would have two names and things would be confusing.
Relational databases make it easy to do certain kinds of queries. For instance, it's very efficient to find out how many fruits there are in total: you just add up all the numbers in the Quantity column in Fruits, and you never need to look at Farmers at all. It's efficient and because the DBMS knows where the data is you can say "give me the sum of the quantity colum" pretty simply in SQL, something like SELECT SUM(Quantity) FROM Fruits. The DBMS will do all the work.
NoSQL databases
So now let's look at the NoSQL databases. These were a much more recent invention, and the economics of computer hardware had changed: memory was a lot cheaper, disk space was absurdly cheap, processors were a lot faster, and programmers were very expensive. The designers of newer databases could make different trade-offs than the designers of RDBMS.
The first difference of NoSQL databases is that they mostly don't store things on disk, or do so only once in a while as a backup. This can be dangerous – if you lose power you can lose all your data – but often a backup from a few minutes or seconds ago is fine and the speed of memory is worth it. A database like Redis writes everything to disk every 200ms or so, which is hardly any time at all, while doing all the real work in memory.
A lot of the perceived performance advantages of "noSQL" databases is just because they keep everything in memory and memory is very fast and disks, even modern solid-state drives, are agonizingly slow by comparison. It's nothing to do with whether the database is relational or not-relational, and nothing at all to do with SQL.
But the other thing NoSQL database designers did was they abandoned the "relational" part of databases. Instead of the model of tables, they tended to model data as objects with keys. A good mental model of this is just JSON:
[ {"name":"bob"} {"name":"susan","age":55} ]
Again, just as a modern RDBMS is not really writing CSV files to disk but is doing wildly optimized stuff, a NoSQL database is not storing everything as a single giant JSON array in memory or disk, but you can mentally model it that way and you won't go far wrong. If I want the record for Bob I ask for ID 0, Susan is ID 1, etc..
One advantage here is that I don't need to plan in advance what I put in each record, I can just throw anything in there. It can be just a name, or a name and an age, or a gigantic object. With a relational DB you have to plan out columns in advance, and changing them later can be tricky and time-consuming.
Another advantage is that if I want to know everything about a farmer, it's all going to be there in one record: their name, their fruits, the prices, everything. In a relational DB that would be more complicated, because you'd have to query the farmers and fruits tables at the same time, a process called "joining" the tables. The SQL "JOIN" keyword is one way to do this.
One disadvantage of storing records as objects like this, formally called an "object store", is that if I want to know how many fruits there are in total, that's easy in an RDBMS but harder here. To sum the quantity of fruits, I have to retrieve each record, find the key for fruits, find all the fruits, find the key for quantity, and add these to a variable. The DBMS for the object store may have an API to do this for me if I've been consistent and made all the objects I stored look the same. But I don't have to do that, so there's a chance the quantities are stored in different places in different objects, making it quite annoying to get right. You often have to write code to do it.
But sometimes that's okay! Sometimes your app doesn't need to relate things across multiple records, it just wants all the data about a single key as fast as possible. Relational databases are best for the former, object stores the best for the latter, but both types can answer both types of questions.
Some of the optimizations I mentioned both types of DBMS use are to allow them to answer the kinds of questions they're otherwise bad at. RDBMS have "object" columns these days that let you store object-type things without adding and removing columns. Object stores frequently have "indexes" that you can set up to be able to find all the keys in a particular place so you can sum up things like Quantity or search for a specific Fruit name fast.
So what's the difference between an "object store" and a "noSQL" database? The first is a formal name for anything that stores structured data as objects (not tables). The second is... well, basically a marketing term. Let's digress into some tech history!
The self-defeating triumph of MySQL
Back in 1995, when the web boomed out of nowhere and suddenly everybody needed a database, databases were mostly commercial software, and expensive. To the rescue came MySQL, invented 1995, and Postgres, invented 1996. They were free! This was a radical idea and everybody adopted them, partly because nobody had any money back then – the whole idea of making money from websites was new and un-tested, there was no such thing as a multi-million dollar seed round. It was free or nothing.
The primary difference between PostgreSQL and MySQL was that Postgres was very good and had lots of features but was very hard to install on Windows (then, as now, the overwhelmingly most common development platform for web devs). MySQL did almost nothing but came with a super-easy installer for Windows. The result was MySQL completely ate Postgres' lunch for years in terms of market share.
Lots of database folks will dispute my assertion that the Windows installer is why MySQL won, or that MySQL won at all. But MySQL absolutely won, and it was because of the installer. MySQL became so popular it became synonymous with "database". You started any new web app by installing MySQL. Web hosting plans came with a MySQL database for free by default, and often no other databases were even available on cheaper hosts, which further accelerated MySQL's rise: defaults are powerful.
The result was people using mySQL for every fucking thing, even for things it was really bad at. For instance, because web devs move fast and change things they had to add new columns to tables all the time, and as I mentioned RDBMS are bad at that. People used MySQL to store uploaded image files, gigantic blobs of binary data that have no place in a DBMS of any kind.
People also ran into a lot of problems with RDBMS and MySQL in particular being optimized for saving memory and storing everything on disk. It made huge databases really slow, and meanwhile memory had got a lot cheaper. Putting tons of data in memory had become practical.
The rise of in-memory databases
The first software to really make use of how cheap memory had become was Memcache, released in 2003. You could run your ordinary RDBMS queries and just throw the results of frequent queries into Memcache, which stored them in memory so they were way, WAY faster to retrieve the second time. It was a revolution in performance, and it was an easy optimization to throw into your existing, RDBMS-based application.
By 2009 somebody realized that if you're just throwing everything in a cache anyway, why even bother having an RDBMS in the first place? Enter MongoDB and Redis, both released in 2009. To contrast themselves with the dominant "MySQL" they called themselves "NoSQL".
What's the difference between an in-memory cache like Memcache and an in-memory database like Redis or MongoDB? The answer is: basically nothing. Redis and Memcache are fundamentally almost identical, Redis just has much better mechanisms for retrieving and accessing the data in memory. A cache is a kind of DB, Memcache is a DBMS, it's just not as easy to do complex things with it as Redis.
Part of the reason Mongo and Redis called themselves NoSQL is because, well, they didn't support SQL. Relational databases let you use SQL to ask questions about relations across tables. Object stores just look up objects by their key most of the time, so the expressiveness of SQL is overkill. You can just make an API call like get(1) to get the record you want.
But this is where marketing became a problem. The NoSQL stores (being in memory) were a lot faster than the relational DBMS (which still mostly used disk). So people got the idea that SQL was the problem, that SQL was why RDBMS were slow. The name "NoSQL" didn't help! It sounded like getting rid of SQL was the point, rather than a side effect. But what most people liked about the NoSQL databases was the performance, and that was just because memory is faster than disk!
Of course, some people genuinely do hate SQL, and not having to use SQL was attractive to them. But if you've built applications of reasonable complexity on both an RDBMS and an object store you'll know that complicated queries are complicated whether you're using SQL or not. I have a lot of love for SQL.
If putting everything in memory makes your database faster, why can't you build an RDBMS that stores everything in memory? You can, and they exist! VoltDB is one example. They're nice! Also, MySQL and Postgres have kind of caught up to the idea that machines have lots more RAM now, so you can configure them to keep things mostly in memory too, so their default performance is a lot better and their performance after being tuned by an expert can be phenomenal.
So anything that's not a relational database is technically a "NoSQL" database. Most NoSQL databases are object stores but that's really just kind of a historical accident.
How does my app talk to a database?
Now we understand how a database works: it's software, running on a machine, managing data for you. How does your app talk to the database over a network and get answers to queries? Are all databases just a single machine?
The answer is: every DBMS, whether relational or object store, is a piece of software that runs on machine(s) that hold the data. There's massive variation: some run on 1 machine, some on clusters of 5-10, some run across thousands of separate machines all at once.
The DBMS software does the management of the data, in memory or on disk, and it presents an API that can be accessed locally, and also more importantly over the network. Sometimes this is a web API like you're used to, literally making GET and POST calls over HTTP to the database. For other databases, especially the older ones, it's a custom protocol.
Either way, you run a piece of software in your app, usually called a Client. That client knows the protocol for talking to the database, whether it's HTTP or WhateverDBProtocol. You tell it where the database server is on the network, it sends queries over and gets responses. Sometimes the queries are literally strings of text, like "SELECT * FROM Fruits", sometimes they are JSON payloads describing records, and any number of other variations.
As a starting point, you can think of the client running on your machine talking over the network to a database running on another machine. Sometimes your app is on dozens of machines, and the database is a single IP address with thousands of machines pretending to be one machine. But it works pretty much the same either way.
The way you tell your client "where" the DB is is your connection credentials, often expressed as a string like "http://username:[email protected]:1234" or "mongodb://...". But this is just a convenient shorthand. All your client really needs to talk to a database is the DNS name (like mydb.com) or an IP address (like 205.195.134.39), plus a port (1234). This tells the network which machine to send the query to, and what "door" to knock on when it gets there.
A little about ports: machines listen on specific ports for things, so if you send something to port 80, the machine knows the query is for your web server, but if you send it to port 1234, it knows the query is for your database. Who picks 1234 (In the case of Postgres, it's literally 5432)? There's no rhyme or reason to it. The developers pick a number that's easy to remember between 1 and 65,535 (the highest port number available) and hope that no other popular piece of software is already using it.
Usually you'll also have a username and password to connect to the database, because otherwise anybody who found your machine could connect to your database and get all the data in it. Forgetting that this is true is a really common source of security breaches!
There are bad people on the internet who literally just try every single IP in the world and send data to the default port for common databases and try to connect without a username or password to see if they can. If it works, they take all the data and then ransom it off. Yikes! Always make sure your database has a password.
Of course, sometimes you don't talk to your database over a network. Sometimes your app and your database live on the same machine. This is common in desktop software but very rare in web apps. If you've ever heard of a "database driver", the "driver" is the equivalent of the "client", but for talking to a local database instead of over a network.
Replication and scaling
Remember I said some databases run on just 1 machine, and some run on thousands of machines? That's known as replication. If you have more than one copy of a piece of data, you have a "replica" of that data, hence the name.
Back in the old days hardware was expensive so it was unusual to have replicas of your data running at the same time. It was expensive. Instead you'd back up your data to tape or something, and if the database went down because the hardware wore out or something, then you'd buy new hardware and (hopefully) reinstall your DBMS and restore the data in a few hours.
Web apps radically changed people's demands of databases. Before web apps, most databases weren't being continuously queried by the public, just a few experts inside normal working hours, and they would wait patiently if the database broke. With a web app you can't have minutes of downtime, far less hours, so replication went from being a rare feature of expensive databases to pretty much table stakes for every database. The initial form of replication was a "hot spare".
If you ran a hot spare, you'd have your main DBMS machine, which handled all queries, and a replica DBMS machine that would copy every single change that happened on the primary to itself. Primary was called m****r and the replica s***e because the latter did whatever the former told it to do, and at the time nobody considered how horrifying that analogy was. These days we call those things "primary/secondary" or "primary/replica" or for more complicated arrangements things like "root/branch/leaf".
Sometimes, people would think having a hot spare meant they didn't need a backup. This is a huge mistake! Remember, the replica copies every change in the main database. So if you accidentally run a command that deletes all the data in your primary database, it will automatically delete all the data in the replica too. Replicas are not backups, as the bookmarking site Magnolia famously learned.
People soon realized having a whole replica machine sitting around doing nothing was a waste, so to be more efficient they changed where traffic went: all the writes would go to the primary, which would copy everything to the replicas, and all the reads would go to the replicas. This was great for scale!
Instead of having 1 machine worth of performance (and you could swap to the hot spare if it failed, and still have 1 machine of performance with no downtime) suddenly you had X machines of performance, where X could be dozens or even hundreds. Very helpful!
But primary/secondary replication of this kind has two drawbacks. First, if a write has arrived at the primary database but not yet replicated to all the secondary machines (which can take half a second if the machines are far apart or overloaded) then somebody reading from the replica can get an answer that's out of date. This is known as a "consistency" failure, and we'll talk about it more later.
The second flaw with primary/second replication is if the primary fails, suddenly you can no longer write to your database. To restore the ability to do writes, you have to take one of the replicas and "promote" it to primary, and change all the other replicas to point at this new primary box. It's time-consuming and notoriously error-prone.
So newer databases invented different ways of arranging the machines, formally called "network topology". If you think of the way machines connect to each other as a diagram, the topology is the shape of that diagram. Primary/secondary looks like a star. Root/branch/leaf looks like a tree. But you can have a ring structure, or a mesh structure, or lots of others. A mesh structure is a lot of fun and very popular, so let's talk about more about them.
Mesh replication databases
In a mesh structure, every machine is talking to every other machine and they all have some portion of the data. You can send a write to any machine and it will either store it, or figure out what machine should store it and send it to that machine. Likewise, you can query any machine in the mesh, and it will give you the answer if it has the data, or forward your request to a machine that does. There's no "primary" machine to fail. Neat!
Because each machine can get away with storing only some of the data and not all of it, a mesh database can store much, much more data than a single machine could store. If 1 machine could store X data, then N machines could theoretically store N*X data. You can almost scale infinitely that way! It's very cool.
Of course, if each record only existed on one machine, then if that machine failed you'd lose those records. So usually in a mesh network more than one machine will have a copy of any individual record. That means you can lose machines without losing data or experiencing downtime; there are other copies lying around. In some mesh databases can also add a new machine to the mesh and the others will notice it and "rebalance" data, increasing the capacity of the database without any downtime. Super cool.
So a mesh topology is a lot more complicated but more resilient, and you can scale it without having to take the database down (usually). This is very nice, but can go horribly wrong if, for instance, there's a network error and suddenly half the machines can't see the other half of the machines in the mesh. This is called a "network partition" and it's a super common failure in large networks. Usually a partition will last only a couple of seconds but that's more than enough to fuck up a database. We'll talk about network partitions shortly.
One important question about a mesh DB is: how do you connect to it? Your client needs to know an IP address to connect to a database. Does it need to know the IP addresses of every machine in the mesh? And what happens when you add and remove machines from the mesh? Sounds messy.
Different Mesh DBs do it differently, but usually you get a load balancer, another machine that accepts all the incoming connections and works out which machine in the mesh should get the question and hands it off. Of course, this means the load balancer can fail, hosing your DB. So usually you'll do some kind of DNS/IP trickery where there are a handful of load balancers all responding on the same domain name or IP address.
The end result is your client magically just needs to know only one name or IP, and that IP always responds because the load balancer always sends you to a working machine.
CAP theory
This brings us neatly to a computer science term often used to talk about databases which is Consistency, Availability, and Partition tolerance, aka CAP or "CAP theory". The basic rule of CAP theory is: you can't have all 3 of Consistency, Availability and Partition Tolerance at the same time. Not because we're not smart enough to build a database that good, but because doing so violates physics.
Consistency means, formally: every query gets the correct, most up-to-date answer (or an error response saying you can't have it).
Availability means: every query gets an answer (but it's not guaranteed to be the correct one).
Partition Tolerance means: if the network craps out, the database will continue to work.
You can already see how these conflict! If you're 100% Available it means by definition you'll never give an error response, so sometimes the data will be out of date, i.e. not Consistent. If your database is Partition Tolerant, on the other hand, it keeps working even if machine A can't talk to machine B, and machine A might have a more recent write than B, so machine B will give stale (i.e. not Consistent) responses to keep working.
So let's think about how CAP theorem applies across the topologies we already talked about.
A single DB on a single machine is definitely Consistent (there's only one copy of the data) and Partition Tolerant (there's no network inside of it to crap out) but not Available because the machine itself can fail, e.g. the hardware could literally break or power could go out.
A primary DB with several replicas is Available (if one replica fails you can ask another) and Partition Tolerant (the replicas will respond even if they're not receiving writes from the primary) but not Consistent (because as mentioned earlier, the replicas might not have every primary write yet).
A mesh DB is extremely Available (all the nodes always answer) and Partition Tolerant (just try to knock it over! It's delightfully robust!) but can be extremely inconsistent because two different machines on the mesh could get a write to the same record at the same time and fight about which one is "correct".
This is the big disadvantage to mesh DBs, which otherwise are wonderful. Sometimes it's impossible to know which of two simultaneous writes is the "winner". There's no single authority, and Very Very Complicated Algorithms are deployed trying to prevent fights breaking out between machines in the mesh about this, with highly variable levels of success and gigantic levels of pain when they inevitably fail. You can't get all three of CAP and Consistency is what mesh networks lose.
In all databases, CAP isn't a set of switches where you are or aren't Consistent, Available, or Partition Tolerant. It's more like a set of sliders. Sliding up the Partition Tolerance generally slides down Consistency, sliding down Availability will give you more Consistency, etc etc.. Every DBMS picks some combination of CAP and picking the right database is often a matter of choosing what CAP combination is appropriate for your application.
Other topologies
Some other terms you frequently hear in the world of databases are "partitions" (which are different from the network partitions of CAP theorem) and "shards". These are both additional topologies available to somebody designing a database. Let's talk about shards first.
Imagine a primary with multiple replicas, but instead of each replica having all the data, each replica has a slice (or shard) of the data. You can slice the data lots of ways. If the database was people, you could have 26 shards, one with all names starting with A, one with all the names starting with B, etc..
Sharding can be helpful if the data is too big to all fit on one disk at a time. This is less of a problem than it used to be because virtual machines these days can effectively have infinity-sized hard drives.
The disadvantage of sharding is it's less Available: if you lose a shard, you lose everybody who starts with that letter! (Of course, your shards can also have replicas...) Plus your software needs to know where all the shards are and which one to ask a question. It's fiddly. Many of the problems of sharded databases are solved by using mesh topologies instead.
Partitions are another way of splitting up a database, but instead of splitting it across many machines, it splits the database across many files in a single machine. This is an old pattern that was useful when you had really powerful hardware and really slow disks, because you could install multiple disks into a single machine and put different partitions on each one, speeding up your achingly slow, disk-based database. These days there's not a lot of reason to use partitions of this kind.
Fin
That concludes this impromptu Databases 101 seminar! I hope you enjoyed learning a little bit more about this fantastically fun and critically important genre of software. from Seldo.Com Feed https://ift.tt/32XwZth
1 note
·
View note
Text
Back End
CRUD Operations: CRUD means create, read, update and delete. Using CRUD we can create data in the database if we want, read the data in the database, update the data and delete the data if we want. CRUD is very important for full-stack projects. If we can create a storefront, blog posting page, todo list or social media clone, without CRUD then we will get stuck very quickly.
JWT: JWT, or JSON Web Token, is an open value used to share security information between a client and a server. Each JWT contains an encoded JSON object that contains a claims set. JWT claims are made using a cryptographic algorithm so that claims cannot be changed after the token has been issued.
Mongoose: Mongoose is a node js-based Object Data Modeling library for MongoDB. SQLAlchemy for SQL databases like an Object Relational Mapper. The problem that Mongoose aims to solve is that developers allow the application layer to apply a specific schema.
SQL and NoSQL databases: In SQL the data is in table form. Here every data is stored inside the row. And in NOSQL the data is in object form. Here all the data of a person is shown in object form.
Aggregation: Provides aggregation data records / documents and returns computed results in MongoDB. It collects values from different documents and groups them together. It then performs various operations on the grouped data such as sum, average, minimum, maximum, etc. to return a calculated result.
Express: Express js is a Node js web application server framework, designed to create single-page, multi-page and hybrid web applications.
Nodejs: Node.js is a JavaScript runtime. It is built on Chrome's V8 JavaScript engine. JavaScript is a popular programming language that runs on any web browser, including a good web browser. Node. js is an interpreter with some specific useful libraries for JavaScript that can be used separately in JS programming. Node. js is primarily used for non-blocking, event-driven servers, due to its single-threaded nature.
Entity: Entity can be anything like a place, class or object which has an independent existence in the real world.
Entity Type: Entity Type represents an entity that has the same characteristics.
Entity Set: Entity Set in the database represents a collection of entities having a particular entity type.
Index hunting: Index hunting is the process of increasing the collection of indexes which helps to improve the query performance as well as to make the database faster.
Fragmentation: Fragmentation controls logical data units. It is also known as a fragment which is stored on different sites of a distributed database system.
Data Dictionary: A data dictionary is a set of information that describes the contents and structure of a table and a database object. The function of the information stored in the data dictionary is to control, manipulate and access the relationships between the database elements.
Primary Key: The primary key is the column in the table where each row of data is individually marked. Each row of the table may have a Primary Key but two rows may not have the same Primary Key.
Composite Key: Composite Key is a form of the candidate key where a set of columns will uniquely identify every row in the table.
Unique key: A unique key is a primary key whose data in each row is individually marked with a null value difference, meaning that the unique key approves a value as a zero value.
Database trigger: A set of commands that is automatically executed when an event occurs, such as deleting a row, while inserting, updating, before inserting into a table is called a database trigger.
B-Tree: B-Tree represents a tree-shaped data structure for external memory that can read and write large blocks of data. It is commonly used in databases and file systems where all insertions, deletions, sorting, etc. are done at logarithmic times.
Normalization: Normalization is the process of extracting unnecessary data from a database by splitting the table in a well-defined manner to maintain data integrity.
De-normalization: De-normalization is the process of adding unnecessary data to a table to speed up complex queries and thus achieve better performance.
BCNF: BCNF is the normal Boyce Codd form. This is a higher version of 3Nf where there are no multiple overlapping candidate keys.
DML Compiler: The DML compiler translates the DML statement into a query language in a low-level instruction and the generated instruction is understood by the Query Evaluation Engine.
#programmer sajeeb#programmer#programming#coder#CRUD Operations#JWT#Mongoose#SQL and NoSQL databases#Aggregation#Express#Nodejs#Entity#Entity Type#Index hunting#Fragmentation#Data Dictionary#Primary Key#Composite Key#Unique key#Database trigger
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
https://ift.tt/1bbWaai" class="redactor-autoparser-object">https://ift.tt/2oVcks4... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
via Blogger https://ift.tt/31JQAg8 #blogger #bloggingtips #bloggerlife #bloggersgetsocial #ontheblog #writersofinstagram #writingprompt #instapoetry #writerscommunity #writersofig #writersblock #writerlife #writtenword #instawriters #spilledink #wordgasm #creativewriting #poetsofinstagram #blackoutpoetry #poetsofig
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
http://lsapi.seomoz.com/linkscape/url-metrics/" class="redactor-autoparser-object">http://lsapi.seomoz.com/linksc... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
from The Moz Blog http://tracking.feedpress.it/link/9375/12915591
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
https://ift.tt/1bbWaai" class="redactor-autoparser-object">https://ift.tt/2oVcks4... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
from The Moz Blog https://ift.tt/2JeY5Fh via IFTTT
0 notes
Text
How to dynamically add/remove table row in jquery with example
How to dynamically add/remove table row in jquery with example
In this article, we will learn about how we can add/remove table rows in jquery as well as we will remove the selected row. Believe me, removing table rows in jquery going to be so easy simple you will understand easily how to remove/add table rows in jquery. To achieve the add/remove table row in jquery, I have used two predefined jquery functions one is append() and the second is…

View On WordPress
#dynamically add/remove row in html table using jquery#dynamically add/remove rows in html table using jquery#dynamically add/remove rows in html table using php#dynamically addremove column in html table using jquery#dynamically addremove rows in html table using javascript#dynamically addremove rows in html table using jquery#dynamically addremove rows in html table using php#how to dynamically add/remove table rows using jquery#how to dynamically addremove table rows using jquery#how to sum values from table column and update when remove or add new row in jquery#javascript add rows to table dynamically#jquery add row to table after specific row#jquery add row to table dynamically jsfiddle#jquery delete table row by id
1 note
·
View note
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
https://ift.tt/1bbWaai" class="redactor-autoparser-object">https://ift.tt/2oVcks4... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
http://lsapi.seomoz.com/linkscape/url-metrics/" class="redactor-autoparser-object">http://lsapi.seomoz.com/linksc... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
https://ift.tt/1bbWaai" class="redactor-autoparser-object">https://ift.tt/2oVcks4... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
Text
Excel Not Grouping Dates in Filters? How to Fix It!
Filters (or AutoFilters) are very powerful in Excel: Not only allow them the basic filtering, but also sorting, filtering by colors and much more. One aspect about the filters I like the most: Gaining a quick overview of the data in the column below. When it comes to dates, filters group dates by year, month and day. That’s very helpful for gaining an overview of the date ranges as well as quickly applying filters. But what, if the filter has suddenly stopped to grouping dates? Here are possible reasons and how to fix it!
Initial comments to the grouping dates issues in filters
A few comment before we start:
By default, dates are grouped by year, month and day in Excel.
The following reasons are listed in the order of how simple and fast it is to apply them. Typically, the last reason is most probably, but before we start a larger investigation, let’s make sure that the “quick fixes” are finished.
If some dates are grouped and some are not in your filters, jump right to reason 3 below.
If you want to know more about filters, please refer to this article.
Reason 1: Grouping dates in filters is disabled
The easiest thing to check if is grouping dates is activated within the Excel options.
First step: Grouping dates in Excel options activated?
In Excel, go to File.
Click on Options (usually in the left bottom corner of the screen).
Go to the Advanced tab in the left pane of the Options window).
Scroll down to the workbook settings and set the check at “Group dates in the AutoFilter menu”.
Back in your filter, is the grouping working now?
(adsbygoogle = window.adsbygoogle || []).push({}); (function() { var done = false; var script = document.createElement('script'); script.async = true; script.type = 'text/javascript'; script.src = '//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js'; var createScript = setTimeout( function(){ document.getElementsByTagName('HEAD').item(0).appendChild(script); }, 5000 ); script.onreadystatechange = script.onload = function(e) { if (!done && (!this.readyState || this.readyState == 'loaded' || this.readyState == 'complete')) { (adsbygoogle = window.adsbygoogle || []).push({}); } }; })();
Reason 2: Your filter is not covering all rows to group dates
This second option is more like a quick side note: Make sure that your filter covers all items in your list or table.
To be sure, why don’t you remove it and insert it again? Here is how to do that.
Hold on a second. Was this information helpful so far?
Connect with me:
Boost your Excel skills: Learn the best Excel tricks and tutorials!
Reason 3: Excel does not recognize the dates as dates
Unfortunately, to my experience this is the most common reasons when Excel does not group dates in filters any longer. Excel does not recognize the dates as dates.
Dates recognized as text and not dates.
You could check this first: What does the summary in the status bar show when you select multiple (more than one, two are enough) date cells?
If it only say “Count: 2” it’s most probably a text cell (see screenshot on the right side). In this case you have to convert the cells to dates. Here is how to do that (see the section of forcing text to number formats or the universal format at the end).
These two selected cells are correctly recognized as date cells.
If the status bar also shows a summary of the dates (with at least one of the following values: minimum, maximum, sum, numerical count or average), both selected cell are recognized as dates. In this case you should not have any problems (at least with these two cells) and check reasons 1 and 2 above.
Image by Sa Ka from Pixabay
Der Beitrag Excel Not Grouping Dates in Filters? How to Fix It! erschien zuerst auf Professor Excel.
from Professor Excel https://ift.tt/39ipRNT
0 notes
Text
Multi-Thumb Sliders: General Case
The first part of this two-part series detailed how we can get a two-thumb slider. Now we'll look at a general multi-thumb case, but with a different and better technique for creating the fills in between the thumbs. And finally, we'll dive into the how behind the styling a realistic 3D-looking slider and a flat one.
Article Series:
Multi-Thumb Sliders: Particular Two-Thumb Case
Multi-Thumb Sliders: General Case (This Post)
A better, more flexible approach
Let's say that, on a wrapper pseudo-element that covers the same area as the range inputs, we stack left-to--right linear-gradient() layers corresponding to each thumb. Each gradient layer is fully opaque (i.e. the alpha is 1) from the track minimum up to the thumb's mid-line, after which it's fully transparent (i.e. the alpha is 0).
Note that the RGB values don't matter because all we care about are the alpha values. I personally use the red (for the fully opaque part) and transparent keywords in the code because they do the job with the least amount of characters.
How do we compute the gradient stop positions where we go from fully opaque to fully transparent? Well, these positions are always situated between a thumb radius from the left edge and a thumb radius from the right edge, so they are within a range that's equal to the useful width (the track width, minus the thumb diameter).
This means we first add a thumb radius.Then we compute the progress by dividing the difference between the current thumb's position and the minimum to the difference (--dif) between the maximum and the minimum. This progress value is a number in the [0, 1] interval — that's 0 when the current thumb position is at the slider's minimum, and 1 when the current thumb position is at the slider's maximum. To get where exactly along that useful width interval we are, we multiply this progress value with the useful width.
The position we're after is the sum between these two length values: the thumb radius and how far we are across the useful width interval.
The demo below allows us to see how everything looks stacked up in the 2D view and how exactly the range inputs and the gradients on their parent's pseudo-element get layered in the 3D view. It's also interactive, so we can drag the slider thumbs and see how the corresponding fill (which is created by a gradient layer on its parent's pseudo-element) changes.
See the Pen by thebabydino (@thebabydino) on CodePen.
The demo is best viewed in Chrome and Firefox.
Alright, but simply stacking these gradient layers doesn't give us the result we're after.
The solution here is to make these gradients mask layers and then XOR them (more precisely, in the case of CSS masks, this means to XOR their alphas).
If you need a refresher on how XOR works, here's one: given two inputs, the output of this operation is 1 if the input values are different (one of them is 1 and the other one is 0) and 0 if the input values are identical (both of them are 0 or both of them are 1)
The truth table for the XOR operation looks as follows:
Inputs Output A B 0 0 0 0 1 1 1 0 1 1 1 0
You can also play with it in the following interactive demo, where you can toggle the input values and see how the output changes:
See the Pen by thebabydino (@thebabydino) on CodePen.
In our case, the input values are the alphas of the gradient mask layers along the horizontal axis. XOR-ing multiple layers means doing so for the first two from the bottom, then XOR-ing the third from the bottom with the result of the previous XOR operation and so on. For our particular case of left-to-right gradients with an alpha equal to 1 up to a point (decided by the corresponding thumb value) and then 0, it looks as illustrated below (we start from the bottom and work our way up):
How we XOR the gradient layer alphas (Demo).
Where both layers from the bottom have an alpha of 1, the resulting layer we get after XOR-ing them has an alpha of 0. Where they have different alpha values, the resulting layer has an alpha of 1. Where they both have an alpha of 0, the resulting layer has an alpha of 0.
Moving up, we XOR the third layer with the resulting layer we got at the previous step. Where both these layers have the same alpha, the alpha of the layer that results from this second XOR operation is 0. Where they have different alphas, the resulting alpha is 1.
Similarly, we then XOR the fourth layer from the bottom with the layer resulting from the second stage XOR operation.
In terms of CSS, this means using the exclude value for the standard mask-composite and the xor value for the non-standard -webkit-mask-composite. (For a better understanding of mask compositing, check out the crash course.)
This technique gives us exactly the result we want while also allowing us to use a single pseudo-element for all the fills. It's also a technique that works for any number of thumbs. Let's see how we can put it into code!
In order to keep things fully flexible, we start by altering the Pug code such that it allows to add or remove a thumb and update everything else accordingly by simply adding or removing an item from an array of thumb objects, where every object contains a value and a label (which will be only for screen readers):
- let min = -50, max = 50; - let thumbs = [ - { val: -15, lbl: 'Value A' }, - { val: 20, lbl: 'Value B' }, - { val: -35, lbl: 'Value C' }, - { val: 45, lbl: 'Value D' } - ]; - let nv = thumbs.length; .wrap(role='group' aria-labelledby='multi-lbl' style=`${thumbs.map((c, i) => `--v${i}: ${c.val}`).join('; ')}; --min: ${min}; --max: ${max}`) #multi-lbl Multi thumb slider: - for(let i = 0; i < nv; i++) label.sr-only(for=`v${i}`) #{thumbs[i].lbl} input(type='range' id=`v${i}` min=min value=thumbs[i].val max=max) output(for=`v${i}` style=`--c: var(--v${i})`)
In the particular case of these exact four values, the generated markup looks as follows:
<div class='wrap' role='group' aria-labelledby='multi-lbl' style='--v0: -15; --v1: 20; --v2: -35; --v3: 45; --min: -50; --max: 50'> <div id='multi-lbl'>Multi thumb slider:</div> <label class='sr-only' for='v0'>Value A</label> <input type='range' id='v0' min='-50' value='-15' max='50'/> <output for='v0' style='--c: var(--v0)'></output> <label class='sr-only' for='v1'>Value B</label> <input type='range' id='v1' min='-50' value='20' max='50'/> <output for='v1' style='--c: var(--v1)'></output> <label class='sr-only' for='v2'>Value C</label> <input type='range' id='v2' min='-50' value='-35' max='50'/> <output for='v2' style='--c: var(--v2)'></output> <label class='sr-only' for='v3'>Value D</label> <input type='range' id='v3' min='-50' value='45' max='50'/> <output for='v3' style='--c: var(--v3)'></output> </div>
We don't need to add anything to the CSS or the JavaScript for this to give us a functional slider where the <output> values get updated as we drag the sliders. However, having four <output> elements while the wrapper's grid still has two columns would break the layout. So, for now, we remove the row introduced for the <output> elements, position these elements absolutely and only make them visible when the corresponding <input> is focused. We also remove the remains of the previous solution that uses both pseudo-elements on the wrapper.
.wrap { /* same as before */ grid-template-rows: max-content #{$h}; /* only 2 rows now */ &::after { background: #95a; // content: ''; // don't display for now grid-column: 1/ span 2; grid-row: 3; } } input[type='range'] { /* same as before */ grid-row: 2; /* last row is second row now */ } output { color: transparent; position: absolute; right: 0; &::after { content: counter(c); counter-reset: c var(--c); } }
We'll be doing more to prettify the result later, but for now, here's what we have:
See the Pen by thebabydino (@thebabydino) on CodePen.
Next, we need to get those thumb to thumb fills. We do this by generating the mask layers in the Pug and putting them in a --fill custom property on the wrapper.
//- same as before - let layers = thumbs.map((c, i) => `linear-gradient(90deg, red calc(var(--r) + (var(--v${i}) - var(--min))/var(--dif)*var(--uw)), transparent 0)`); .wrap(role='group' aria-labelledby='multi-lbl' style=`${thumbs.map((c, i) => `--v${i}: ${c.val}`).join('; ')}; --min: ${min}; --max: ${max}; --fill: ${layers.join(', ')}`) // - same as before
The generated HTML for the particular case of four thumbs with these values can be seen below. Note that this gets altered automatically if we add or remove items from the initial array:
<div class='wrap' role='group' aria-labelledby='multi-lbl' style='--v0: -15; --v1: 20; --v2: -35; --v3: 45; --min: -50; --max: 50; --fill: linear-gradient(90deg, red calc(var(--r) + (var(--v0) - var(--min))/var(--dif)*var(--uw)), transparent 0), linear-gradient(90deg, red calc(var(--r) + (var(--v1) - var(--min))/var(--dif)*var(--uw)), transparent 0), linear-gradient(90deg, red calc(var(--r) + (var(--v2) - var(--min))/var(--dif)*var(--uw)), transparent 0), linear-gradient(90deg, red calc(var(--r) + (var(--v3) - var(--min))/var(--dif)*var(--uw)), transparent 0)'> <div id='multi-lbl'>Multi thumb slider:</div> <label class='sr-only' for='v0'>Value A</label> <input type='range' id='v0' min='-50' value='-15' max='50'/> <output for='v0' style='--c: var(--v0)'></output> <label class='sr-only' for='v1'>Value B</label> <input type='range' id='v1' min='-50' value='20' max='50'/> <output for='v1' style='--c: var(--v1)'></output> <label class='sr-only' for='v2'>Value C</label> <input type='range' id='v2' min='-50' value='-35' max='50'/> <output for='v2' style='--c: var(--v2)'></output> <label class='sr-only' for='v3'>Value D</label> <input type='range' id='v3' min='-50' value='45' max='50'/> <output for='v3' style='--c: var(--v3)'></output> </div>
Note that this means we need to turn the Sass variables relating to dimensions into CSS variables and replace the Sass variables in the properties that use them:
.wrap { /* same as before */ --w: 20em; --h: 4em; --d: calc(.5*var(--h)); --r: calc(.5*var(--d)); --uw: calc(var(--w) - var(--d)); background: linear-gradient(0deg, #ccc var(--h), transparent 0); grid-template: max-content var(--h)/ var(--w); width: var(--w); }
We set our mask Oo the wrapper's ::after pseudo-element:
.wrap { /* same as before */ &::after { content: ''; background: #95a; grid-column: 1/ span 2; grid-row: 2; /* non-standard WebKit version */ -webkit-mask: var(--fill); -webkit-mask-composite: xor; /* standard version, supported in Firefox */ mask: var(--fill); mask-composite: exclude; } }
Now we have exactly what we want and the really cool thing about this technique is that all we need to do to change the number of thumbs is add or remove thumb objects (with a value and a label for each) to the thumbs array in the Pug code — absolutely nothing else needs to change!
See the Pen by thebabydino (@thebabydino) on CodePen.
Prettifying tweaks
What we have so far is anything but a pretty sight. So let's start fixing that!
Option #1: a realistic look
Let's say we want to achieve the result below:
The realistic look we're after.
A first step would be to make the track the same height as the thumb and round the track ends. Up to this point, we've emulated the track with a background on the .wrap element. While it's technically possible to emulate a track with rounded ends by using layered linear and radial gradients, it's really not the best solution, especially when the wrapper still has a free pseudo-element (the ::before).
.wrap { /* same as before */ --h: 2em; --d: var(--h); &::before, &::after { border-radius: var(--r); background: #ccc; content: ''; grid-column: 1/ span 2; grid-row: 2; } &::after { background: #95a; /* non-standard WebKit version */ -webkit-mask: var(--fill); -webkit-mask-composite: xor; /* standard version, supported in Firefox */ mask: var(--fill); mask-composite: exclude; } }
See the Pen by thebabydino (@thebabydino) on CodePen.
Using ::before to emulate the track opens up the possibility of getting a slightly 3D look:
<pre rel="SCSS"><code class="language-scss">.wrap { /* same as before */ &::before, &::after { /* same as before */ box-shadow: inset 0 2px 3px rgba(#000, .3); } &::after { /* same as before */ background: linear-gradient(rgba(#fff, .3), rgba(#000, .3)) #95a; } }
I'm by no means a designer, so those values could probably be tweaked for a better looking result, but we can already see a difference:
See the Pen by thebabydino (@thebabydino) on CodePen.
This leaves us with a really ugly thumb, so let's fix that part as well!
We make use of the technique of layering multiple backgrounds with different background-clip (and background-origin) values.
@mixin thumb() { border: solid calc(.5*var(--r)) transparent; border-radius: 50%; /* make circular */ box-sizing: border-box; /* different between Chrome & Firefox */ /* box-sizing needed now that we have a non-zero border */ background: linear-gradient(rgba(#000, .15), rgba(#fff, .2)) content-box, linear-gradient(rgba(#fff, .3), rgba(#000, .3)) border-box, currentcolor; pointer-events: auto; width: var(--d); height: var(--d); }
I've described this technique in a lot of detail in an older article. Make sure you check it out if you need a refresher!
The above bit of code would do close to nothing, however, if the currentcolor value is black (#000) which it is right now. Let's fix that and also change the cursor on the thumbs to something more fitting:
input[type='range'] { /* same as before */ color: #eee; cursor: grab; &:active { cursor: grabbing; } }
The result is certainly more satisfying than before:
See the Pen by thebabydino (@thebabydino) on CodePen.
Something else that really bothers me is how close the label text is to the slider. We can fix this by introducing a grid-gap on the wrapper:
.wrap { /* same as before */ grid-gap: .625em; }
But the worst problem we still have are those absolutely positioned outputs in the top right corner. The best way to fix this is to introduce a third grid row for them and move them with the thumbs.
The position of the thumbs is computed in a similar manner to that of the sharp stops of the gradient layers we use for the fill mask.
Initially, we place the left edge of the outputs along the vertical line that's a thumb radius --r away from the left edge of the slider. In order to middle align the outputs with this vertical line, we translate them back (to the left, in the negative direction of the x-axis, so we need a minus sign) by half of their width (50%, as percentage values in translate() functions are relative to the dimensions of the element the transform is applied to).
In order to move them with the thumbs, we subtract the minimum value (--min) from the current value of the corresponding thumb (--c), divide this difference by the difference (--dif) between the maximum value (--max) and the minimum value (--min). This gives us a progress value in the [0, 1] interval. We then multiply this value with the useful width (--uw), which describes the real range of motion.
.wrap { /* same as before */ grid-template-rows: max-content var(--h) max-content; } output { background: currentcolor; border-radius: 5px; color: transparent; grid-column: 1; grid-row: 3; margin-left: var(--r); padding: 0 .375em; transform: translate(calc((var(--c) - var(--min))/var(--dif)*var(--uw) - 50%)); width: max-content; &::after { color: #fff; content: counter(c); counter-reset: c var(--c); } }
See the Pen by thebabydino (@thebabydino) on CodePen.
This looks much better at a first glance. However, a closer inspection reveals that we still have a bunch of problems.
The first one is that overflow: hidden cuts out a bit of the <output> elements when we get to the track end.
The left end of our <output> elements (with the rounded corners) gets cut out when it goes beyond the left end of the parent wrapper .wrap.
In order to fix this, we must understand what exactly overflow: hidden does. It cuts out everything outside an element's padding-box, as illustrated by the interactive demo below, where you can click the code to toggle the CSS declaration.
See the Pen by thebabydino (@thebabydino) on CodePen.
This means a quick fix for this issue is to add a big enough lateral padding on the wrapper .wrap.
padding: 0 2em;
We're styling our multi-thumb slider in isolation here, but, in reality, it probably won't be the only thing on a page, so, if spacing is limited, we can invert that lateral padding with a negative lateral margin.
If the nearby elements still have the default have position: static, the fact that we've relatively positioned the wrapper should make the outputs go on top of what they overlap, otherwise, tweaking the z-index on the .wrap should do it.
The bigger problem is that this technique we've used results in some really weird-looking <output> overlaps when were dragging the thumbs.
The <output> elements are only hidden by the page background when their corresponding thumbs are not focused and this can cause issues when dragging thumbs over each other.
Increasing the z-index when the <input> is focused on the corresponding <output> as well solves the particular problem of the <output> overlaps:
input[type='range'] { &:focus { outline: solid 0 transparent; &, & + output { color: darkorange; z-index: 2; } } }
However, it does nothing for the underlying issue and this becomes obvious when we change the background on the body, particularly if we change it to an image one, as this doesn't allow the <output> text to hide in it anymore:
See the Pen by thebabydino (@thebabydino) on CodePen.
This means we need to rethink how we hide the <output> elements in the normal state and how we reveal them in a highlight state, such as :focus. We also want to do this without bloating our CSS.
The solution is to use the technique I described about a year ago in the "DRY Switching with CSS Variables" article: use a highlight --hl custom property where the value is 0 in the normal state and 1 in a highlight state (:focus). We also compute its negation (--nothl).
* { --hl: 0; --nothl: calc(1 - var(--hl)); margin: 0; font: inherit }
As it is, this does nothing yet. The trick is to make all properties that we want to change in between the two states depend on --hl and, if necessary, its negation (code>--nothl).
$hlc: #f90; @mixin thumb() { /* same as before */ background-color: $hlc; } input[type='range'] { /* same as before */ filter: grayScale(var(--nothl)); z-index: calc(1 + var(--hl)); &:focus { outline: solid 0 transparent; &, & + output { --hl: 1; } } } output { /* same grid placement */ margin-left: var(--r); max-width: max-content; transform: translate(calc((var(--c) - var(--min))/var(--dif)*var(--uw))); &::after { /* same as before */ background: linear-gradient(rgba(#fff, .3), rgba(#000, .3)) $hlc; border-radius: 5px; display: block; padding: 0 .375em; transform: translate(-50%) scale(var(--hl)); } }
See the Pen by thebabydino (@thebabydino) on CodePen.
We're almost there! We can also add transitions on state change:
$t: .3s; input[type='range'] { /* same as before */ transition: filter $t ease-out; } output::after { /* same as before */ transition: transform $t ease-out; }
See the Pen by thebabydino (@thebabydino) on CodePen.
A final improvement would be to grayscale() the fill if none of the thumbs are focused. We can do this by using :focus-within on our wrapper:
.wrap { &::after { /* same as before */ filter: Grayscale(var(--nothl)); transition: filter $t ease-out; } &:focus-within { --hl: 1; } }
And that's it!
See the Pen by thebabydino (@thebabydino) on CodePen.
Option #2: A flat look
Let's see how we can get a flat design. For example:
The flat look we're after.
The first step is to remove the box shadows and gradients that give our previous demo a 3D look and make the track background a repeating gradient.:
See the Pen by thebabydino (@thebabydino) on CodePen.
The size change of the thumb on :focus can be controlled with a scaling transform with a factor that depends on the highlight switch variable (--hl).
@mixin thumb() { /* same as before */ transform: scale(calc(1 - .5*var(--nothl))); transition: transform $t ease-out; }
See the Pen by thebabydino (@thebabydino) on CodePen.
But what about the holes in the track around the thumbs?
The mask compositing technique is extremely useful here. This involves layering radial gradients to create discs at every thumb position and, after we're done with them, invert (i.e. compositing with a fully opaque layer) the result to turn those discs into holes.
How we XOR the gradient layer alphas (Demo).
This means altering the Pug code a bit so that we're generating the list of radial gradients that create the discs corresponding to each thumb. In turn, we'll invert those in the CSS:
//- same as before - let tpos = thumbs.map((c, i) => `calc(var(--r) + (var(--v${i}) - var(--min))/var(--dif)*var(--uw))`); - let fill = tpos.map(c => `linear-gradient(90deg, red ${c}, transparent 0)`); - let hole = tpos.map(c => `radial-gradient(circle at ${c}, red var(--r), transparent 0)`) .wrap(role='group' aria-labelledby='multi-lbl' style=`${thumbs.map((c, i) => `--v${i}: ${c.val}`).join('; ')}; --min: ${min}; --max: ${max}; --fill: ${fill.join(', ')}; --hole: ${hole.join(', ')}`) // -same wrapper content as before
This generates the following markup:
<div class='wrap' role='group' aria-labelledby='multi-lbl' style='--v0: -15; --v1: 20; --v2: -35; --v3: 45; --min: -50; --max: 50; --fill: linear-gradient(90deg, red calc(var(--r) + (var(--v0) - var(--min))/var(--dif)*var(--uw)), transparent 0), linear-gradient(90deg, red calc(var(--r) + (var(--v1) - var(--min))/var(--dif)*var(--uw)), transparent 0), linear-gradient(90deg, red calc(var(--r) + (var(--v2) - var(--min))/var(--dif)*var(--uw)), transparent 0), linear-gradient(90deg, red calc(var(--r) + (var(--v3) - var(--min))/var(--dif)*var(--uw)), transparent 0); --hole: radial-gradient(circle at calc(var(--r) + (var(--v0) - var(--min))/var(--dif)*var(--uw)), red var(--r), transparent 0), radial-gradient(circle at calc(var(--r) + (var(--v1) - var(--min))/var(--dif)*var(--uw)), red var(--r), transparent 0), radial-gradient(circle at calc(var(--r) + (var(--v2) - var(--min))/var(--dif)*var(--uw)), red var(--r), transparent 0), radial-gradient(circle at calc(var(--r) + (var(--v3) - var(--min))/var(--dif)*var(--uw)), red var(--r), transparent 0)'> <!-- same content as before --> </div>
In the CSS, we set a mask on both pseudo-elements and give a different value for each one. We also XOR the mask layers on them.
In the case of ::before, the mask is the list of radial-gradient() discs XOR-ed with a fully opaque layer (which acts as an inverter to turn the discs into circular holes). For ::after, it's the list of fill linear-gradient() layers.
.wrap { /* same as before */ &::before, &::after { content: ''; /* same as before */ --mask: linear-gradient(red, red), var(--hole); /* non-standard WebKit version */ -webkit-mask: var(--mask); -webkit-mask-composite: xor; /* standard version, supported in Firefox */ mask: var(--mask); mask-composite: exclude; } &::after { background: #95a; --mask: var(--fill); } }
See the Pen by thebabydino (@thebabydino) on CodePen.
The final step is to adjust the track, fill height, and middle align them vertically within their grid cell (along with the thumbs):
.wrap { /* same as before */ &::before, &::after { /* same as before */ align-self: center; height: 6px; } }
We now have our desired flat multi-thumb slider!
See the Pen by thebabydino (@thebabydino) on CodePen.
The post Multi-Thumb Sliders: General Case appeared first on CSS-Tricks.
Multi-Thumb Sliders: General Case published first on https://deskbysnafu.tumblr.com/
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
https://ift.tt/1bbWaai" class="redactor-autoparser-object">https://ift.tt/2oVcks4... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
https://ift.tt/1bbWaai" class="redactor-autoparser-object">https://ift.tt/2oVcks4... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
https://ift.tt/2Wc2fmI
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
http://bit.ly/32DCMF5" class="redactor-autoparser-object">http://bit.ly/2W2NToq... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes
Text
SEO Analytics for Free - Combining Google Search with the Moz API
Posted by Purple-Toolz
I’m a self-funded start-up business owner. As such, I want to get as much as I can for free before convincing our finance director to spend our hard-earned bootstrapping funds. I’m also an analyst with a background in data and computer science, so a bit of a geek by any definition.
What I try to do, with my SEO analyst hat on, is hunt down great sources of free data and wrangle it into something insightful. Why? Because there’s no value in basing client advice on conjecture. It’s far better to combine quality data with good analysis and help our clients better understand what’s important for them to focus on.
In this article, I will tell you how to get started using a few free resources and illustrate how to pull together unique analytics that provide useful insights for your blog articles if you’re a writer, your agency if you’re an SEO, or your website if you’re a client or owner doing SEO yourself.
The scenario I’m going to use is that I want analyze some SEO attributes (e.g. backlinks, Page Authority etc.) and look at their effect on Google ranking. I want to answer questions like “Do backlinks really matter in getting to Page 1 of SERPs?” and “What kind of Page Authority score do I really need to be in the top 10 results?” To do this, I will need to combine data from a number of Google searches with data on each result that has the SEO attributes in that I want to measure.
Let’s get started and work through how to combine the following tasks to achieve this, which can all be setup for free:
Querying with Google Custom Search Engine
Using the free Moz API account
Harvesting data with PHP and MySQL
Analyzing data with SQL and R
Querying with Google Custom Search Engine
We first need to query Google and get some results stored. To stay on the right side of Google’s terms of service, we’ll not be scraping Google.com directly but will instead use Google’s Custom Search feature. Google’s Custom Search is designed mainly to let website owners provide a Google like search widget on their website. However, there is also a REST based Google Search API that is free and lets you query Google and retrieve results in the popular JSON format. There are quota limits but these can be configured and extended to provide a good sample of data to work with.
When configured correctly to search the entire web, you can send queries to your Custom Search Engine, in our case using PHP, and treat them like Google responses, albeit with some caveats. The main limitations of using a Custom Search Engine are: (i) it doesn’t use some Google Web Search features such as personalized results and; (ii) it may have a subset of results from the Google index if you include more than ten sites.
Notwithstanding these limitations, there are many search options that can be passed to the Custom Search Engine to proxy what you might expect Google.com to return. In our scenario, we passed the following when making a call:
https://www.googleapis.com/customsearch/v1?key=<google_api_id>&userIp= <ip_address>&cx<custom_search_engine_id>&q=iPhone+X&cr=countryUS&start= 1</custom_search_engine_id></ip_address></google_api_id>
Where:
https://www.googleapis.com/customsearch/v1 – is the URL for the Google Custom Search API
key=<GOOGLE_API_ID> – Your Google Developer API Key
userIp=<IP_ADDRESS> – The IP address of the local machine making the call
cx=<CUSTOM_SEARCH_ENGINE_ID> – Your Google Custom Search Engine ID
q=iPhone+X – The Google query string (‘+’ replaces ‘ ‘)
cr=countryUS – Country restriction (from Goolge’s Country Collection Name list)
start=1 – The index of the first result to return – e.g. SERP page 1. Successive calls would increment this to get pages 2–5.
Google has said that the Google Custom Search engine differs from Google .com, but in my limited prod testing comparing results between the two, I was encouraged by the similarities and so continued with the analysis. That said, keep in mind that the data and results below come from Google Custom Search (using ‘whole web’ queries), not Google.com.
Using the free Moz API account
Moz provide an Application Programming Interface (API). To use it you will need to register for a Mozscape API key, which is free but limited to 2,500 rows per month and one query every ten seconds. Current paid plans give you increased quotas and start at $250/month. Having a free account and API key, you can then query the Links API and analyze the following metrics:
Moz data field
Moz API code
Description
ueid
32
The number of external equity links to the URL
uid
2048
The number of links (external, equity or nonequity or not,) to the URL
umrp**
16384
The MozRank of the URL, as a normalized 10-point score
umrr**
16384
The MozRank of the URL, as a raw score
fmrp**
32768
The MozRank of the URL's subdomain, as a normalized 10-point score
fmrr**
32768
The MozRank of the URL's subdomain, as a raw score
us
536870912
The HTTP status code recorded for this URL, if available
upa
34359738368
A normalized 100-point score representing the likelihood of a page to rank well in search engine results
pda
68719476736
A normalized 100-point score representing the likelihood of a domain to rank well in search engine results
NOTE: Since this analysis was captured, Moz documented that they have deprecated these fields. However, in testing this (15-06-2019), the fields were still present.
Moz API Codes are added together before calling the Links API with something that looks like the following:
www.apple.com%2F?Cols=103616137253&AccessID=MOZ_ACCESS_ID& Expires=1560586149&Signature=<MOZ_SECRET_KEY>
Where:
http://lsapi.seomoz.com/linkscape/url-metrics/" class="redactor-autoparser-object">http://lsapi.seomoz.com/linksc... – Is the URL for the Moz API
http%3A%2F%2Fwww.apple.com%2F – An encoded URL that we want to get data on
Cols=103616137253 – The sum of the Moz API codes from the table above
AccessID=MOZ_ACCESS_ID – An encoded version of the Moz Access ID (found in your API account)
Expires=1560586149 – A timeout for the query - set a few minutes into the future
Signature=<MOZ_SECRET_KEY> – An encoded version of the Moz Access ID (found in your API account)
Moz will return with something like the following JSON:
Array ( [ut] => Apple [uu] => <a href="http://www.apple.com/" class="redactor-autoparser-object">www.apple.com/</a> [ueid] => 13078035 [uid] => 14632963 [uu] => www.apple.com/ [ueid] => 13078035 [uid] => 14632963 [umrp] => 9 [umrr] => 0.8999999762 [fmrp] => 2.602215052 [fmrr] => 0.2602215111 [us] => 200 [upa] => 90 [pda] => 100 )
For a great starting point on querying Moz with PHP, Perl, Python, Ruby and Javascript, see this repository on Github. I chose to use PHP.
Harvesting data with PHP and MySQL
Now we have a Google Custom Search Engine and our Moz API, we’re almost ready to capture data. Google and Moz respond to requests via the JSON format and so can be queried by many popular programming languages. In addition to my chosen language, PHP, I wrote the results of both Google and Moz to a database and chose MySQL Community Edition for this. Other databases could be also used, e.g. Postgres, Oracle, Microsoft SQL Server etc. Doing so enables persistence of the data and ad-hoc analysis using SQL (Structured Query Language) as well as other languages (like R, which I will go over later). After creating database tables to hold the Google search results (with fields for rank, URL etc.) and a table to hold Moz data fields (ueid, upa, uda etc.), we’re ready to design our data harvesting plan.
Google provide a generous quota with the Custom Search Engine (up to 100M queries per day with the same Google developer console key) but the Moz free API is limited to 2,500. Though for Moz, paid for options provide between 120k and 40M rows per month depending on plans and range in cost from $250–$10,000/month. Therefore, as I’m just exploring the free option, I designed my code to harvest 125 Google queries over 2 pages of SERPs (10 results per page) allowing me to stay within the Moz 2,500 row quota. As for which searches to fire at Google, there are numerous resources to use from. I chose to use Mondovo as they provide numerous lists by category and up to 500 words per list which is ample for the experiment.
I also rolled in a few PHP helper classes alongside my own code for database I/O and HTTP.
In summary, the main PHP building blocks and sources used were:
Google Custom Search Engine – Ash Kiswany wrote an excellent article using Jacob Fogg’s PHP interface for Google Custom Search;
Mozscape API – As mentioned, this PHP implementation for accessing Moz on Github was a good starting point;
Website crawler and HTTP – At Purple Toolz, we have our own crawler called PurpleToolzBot which uses Curl for HTTP and this Simple HTML DOM Parser;
Database I/O – PHP has excellent support for MySQL which I wrapped into classes from these tutorials.
One factor to be aware of is the 10 second interval between Moz API calls. This is to prevent Moz being overloaded by free API users. To handle this in software, I wrote a "query throttler" which blocked access to the Moz API between successive calls within a timeframe. However, whilst working perfectly it meant that calling Moz 2,500 times in succession took just under 7 hours to complete.
Analyzing data with SQL and R
Data harvested. Now the fun begins!
It’s time to have a look at what we’ve got. This is sometimes called data wrangling. I use a free statistical programming language called R along with a development environment (editor) called R Studio. There are other languages such as Stata and more graphical data science tools like Tableau, but these cost and the finance director at Purple Toolz isn’t someone to cross!
I have been using R for a number of years because it’s open source and it has many third-party libraries, making it extremely versatile and appropriate for this kind of work.
Let’s roll up our sleeves.
I now have a couple of database tables with the results of my 125 search term queries across 2 pages of SERPS (i.e. 20 ranked URLs per search term). Two database tables hold the Google results and another table holds the Moz data results. To access these, we’ll need to do a database INNER JOIN which we can easily accomplish by using the RMySQL package with R. This is loaded by typing "install.packages('RMySQL')" into R’s console and including the line "library(RMySQL)" at the top of our R script.
We can then do the following to connect and get the data into an R data frame variable called "theResults."
library(RMySQL) # INNER JOIN the two tables theQuery <- " SELECT A.*, B.*, C.* FROM ( SELECT cseq_search_id FROM cse_query ) A -- Custom Search Query INNER JOIN ( SELECT cser_cseq_id, cser_rank, cser_url FROM cse_results ) B -- Custom Search Results ON A.cseq_search_id = B.cser_cseq_id INNER JOIN ( SELECT * FROM moz ) C -- Moz Data Fields ON B.cser_url = C.moz_url ; " # [1] Connect to the database # Replace USER_NAME with your database username # Replace PASSWORD with your database password # Replace MY_DB with your database name theConn <- dbConnect(dbDriver("MySQL"), user = "USER_NAME", password = "PASSWORD", dbname = "MY_DB") # [2] Query the database and hold the results theResults <- dbGetQuery(theConn, theQuery) # [3] Disconnect from the database dbDisconnect(theConn)
NOTE: I have two tables to hold the Google Custom Search Engine data. One holds data on the Google query (cse_query) and one holds results (cse_results).
We can now use R’s full range of statistical functions to begin wrangling.
Let’s start with some summaries to get a feel for the data. The process I go through is basically the same for each of the fields, so let’s illustrate and use Moz’s ‘UEID’ field (the number of external equity links to a URL). By typing the following into R I get the this:
> summary(theResults$moz_ueid) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 1 20 14709 182 2755274 > quantile(theResults$moz_ueid, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 0.0 0.0 0.0 1.0 20.0 182.0 337.2 1715.2 7873.4 412283.4 2755274.0
Looking at this, you can see that the data is skewed (a lot) by the relationship of the median to the mean, which is being pulled by values in the upper quartile range (values beyond 75% of the observations). We can however, plot this as a box and whisker plot in R where each X value is the distribution of UEIDs by rank from Google Custom Search position 1-20.
Note we are using a log scale on the y-axis so that we can display the full range of values as they vary a lot!
A box and whisker plot in R of Moz’s UEID by Google rank (note: log scale)
Box and whisker plots are great as they show a lot of information in them (see the geom_boxplot function in R). The purple boxed area represents the Inter-Quartile Range (IQR) which are the values between 25% and 75% of observations. The horizontal line in each ‘box’ represents the median value (the one in the middle when ordered), whilst the lines extending from the box (called the ‘whiskers’) represent 1.5x IQR. Dots outside the whiskers are called ‘outliers’ and show where the extents of each rank’s set of observations are. Despite the log scale, we can see a noticeable pull-up from rank #10 to rank #1 in median values, indicating that the number of equity links might be a Google ranking factor. Let’s explore this further with density plots.
Density plots are a lot like distributions (histograms) but show smooth lines rather than bars for the data. Much like a histogram, a density plot’s peak shows where the data values are concentrated and can help when comparing two distributions. In the density plot below, I have split the data into two categories: (i) results that appeared on Page 1 of SERPs ranked 1-10 are in pink and; (ii) results that appeared on SERP Page 2 are in blue. I have also plotted the medians of both distributions to help illustrate the difference in results between Page 1 and Page 2.
The inference from these two density plots is that Page 1 SERP results had more external equity backlinks (UEIDs) on than Page 2 results. You can also see the median values for these two categories below which clearly shows how the value for Page 1 (38) is far greater than Page 2 (11). So we now have some numbers to base our SEO strategy for backlinks on.
# Create a factor in R according to which SERP page a result (cser_rank) is on > theResults$rankBin <- paste("Page", ceiling(theResults$cser_rank / 10)) > theResults$rankBin <- factor(theResults$rankBin) # Now report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_ueid, theResults$rankBin, median) Page 1 Page 2 38 11
From this, we can deduce that equity backlinks (UEID) matter and if I were advising a client based on this data, I would say they should be looking to get over 38 equity-based backlinks to help them get to Page 1 of SERPs. Of course, this is a limited sample and more research, a bigger sample and other ranking factors would need to be considered, but you get the idea.
Now let’s investigate another metric that has less of a range on it than UEID and look at Moz’s UPA measure, which is the likelihood that a page will rank well in search engine results.
> summary(theResults$moz_upa) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 33.00 41.00 41.22 50.00 81.00 > quantile(theResults$moz_upa, probs = c(1, 5, 10, 25, 50, 75, 80, 90, 95, 99, 100)/100) 1% 5% 10% 25% 50% 75% 80% 90% 95% 99% 100% 12 20 25 33 41 50 53 58 62 75 81
UPA is a number given to a URL and ranges between 0–100. The data is better behaved than the previous UEID unbounded variable having its mean and median close together making for a more ‘normal’ distribution as we can see below by plotting a histogram in R.
A histogram of Moz’s UPA score
We’ll do the same Page 1 : Page 2 split and density plot that we did before and look at the UPA score distributions when we divide the UPA data into two groups.
# Report the medians by SERP page by calling ‘tapply’ > tapply(theResults$moz_upa, theResults$rankBin, median) Page 1 Page 2 43 39
In summary, two very different distributions from two Moz API variables. But both showed differences in their scores between SERP pages and provide you with tangible values (medians) to work with and ultimately advise clients on or apply to your own SEO.
Of course, this is just a small sample and shouldn’t be taken literally. But with free resources from both Google and Moz, you can now see how you can begin to develop analytical capabilities of your own to base your assumptions on rather than accepting the norm. SEO ranking factors change all the time and having your own analytical tools to conduct your own tests and experiments on will help give you credibility and perhaps even a unique insight on something hitherto unknown.
Google provide you with a healthy free quota to obtain search results from. If you need more than the 2,500 rows/month Moz provide for free there are numerous paid-for plans you can purchase. MySQL is a free download and R is also a free package for statistical analysis (and much more).
Go explore!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!
0 notes