Tumgik
#Like we have some example datasets where he already put all the solutions in and stuff
zarafey · 2 years
Text
Anyone possibly know how to do a methaanalysis? I mean I sure as hell have no idea what do to with the data my professor sent me because he never actually got around to showing us wtf to do with it and how to use that damned statistics software.
0 notes
srasamua · 5 years
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
2 notes · View notes
analyticsindiam · 5 years
Text
Intel Readies For An AI Revolution With A Comprehensive AI Solutions Stack
Tumblr media
Global technology player Intel has been a catalyst for some of the most significant technology transformations in the last 50 years, preparing its partners, customers and enterprise users for a digital era. In the area of artificial intelligence (AI) and deep learning (DL), Intel is at the forefront of providing end-to-end solutions that are creating immense business value. But there’s one more area where the technology giant is playing a central role. Intel is going to the heart of the developer community by providing a wealth of software and developer tools that can simplify building and deployment of DL-driven solutions and take care of all computing requirements so that data scientists, machine learning engineers and practitioners can focus on delivering solutions that grant real business value. The company’s software offerings provide a range of options to meet the varying needs of data scientists, developers and researchers at various levels of AI expertise. So, why are AI software development tools more important now than ever? As architectural diversity increases and the compute environment becomes more sophisticated, the developer community needs access to a comprehensive suite of tools that can enable them to build applications better, faster and more easily and reliably without worrying about the underlying architecture. What Intel is primarily doing is empowering coders, data scientists and researchers to become more productive by taking away the code complexity. Intel Makes AI More Accessible For The Developer Community In more ways than one, software has become the last mile between the developers and the underlying hardware infrastructure, enabling them to utilise the optimization capabilities of processors. Analytics India Magazine spoke to Akanksha Bilani, Country Lead – India, Singapore, ANZ at Intel Software to understand why, in today’s world, transformation of software is key to driving effective business, usage models and market opportunity. “Gone are the days where adding more racks to existing platforms helped drive productivity. Moore’s law and AI advocates that the way to take advantage of hardware is by driving innovation on software that runs on top of it. Studies show that modernization, parallelisation and optimization of software on the hardware helps in doubling the performance of our hardware,” she emphasizes. Going forward, the convergence of architecture innovation and optimized software for platforms will be the only way to harness the potential of future paradigms of AI, High Performance Computing (HPC) and the Internet of Everything (IoE). Intel’s Naveen Rao, Corporate Vice President and General Manager, Artificial Intelligence Products Group at Intel Corporation, summed up the above statement at the recently concluded AI Hardware1 summit.  It’s not just a ‘fast chip’ - but a portfolio of products with a software roadmap that can enable the developer community to leverage the capabilities of the new AI hardware. “AI models are growing by 2x every 3 months. So it will take a village of technologies to meet the demands: 2x by software, 2x by architecture, 2x by silicon process and 4x by interconnect,” he stated.  Simplifying AI Workflows With Intel® Software Development Tools As the global technology major leads the way forward in data-driven transformation, we are seeing Intel® Software2 solutions open up a new set of possibilities across multiple sectors. In retail,  the Intel® Distribution of OpenVINO™ Toolkit is helping business leaders3 take advantage of near real-time insights to help make better decisions faster. Wipro4 has built groundbreaking edge AI solutions on server class Intel® Xeon® Scalable Processors and the Intel® Distribution of OpenVINO™ Toolkit. Today, data scientists who are building cutting-edge AI algorithms rely very heavily on Intel® Distribution for Python to get higher performance gains. While stock Python products bring a great deal performance to the table, the Intel performance libraries that come already plugged in with Intel® Distribution for Python help programmes gain more significant speed-ups as compared to the open source scikit-learn. Now, those working in distributed environments leverage BigDL, a DL library for Apache Spark. This distributed DL library helps data scientists accelerate DL inference on CPUs in their Spark environment. “BigDL is an add-on to the machine learning pipeline and delivers an incredible amount of performance gains,” Bilani elaborates.  Then there’s also Intel® Data Analytics Acceleration Library (Intel® DAAL), widely used by data scientists for its range of algorithms, ranging from the most basic descriptive statistics for datasets to more advanced data mining and machine learning algorithms. For every stage in the development pipeline, there are tools providing APIs and it can be used with other popular data platforms such as Hadoop, Matlab, Spark and R. There is also another audience that Intel caters to — the tuning experts who really understand their programs and want to get the maximum performance out of their architecture. For these users, the company offers its Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) — an open source, performance-enhancing library which has been abstracted to a great extent to allow developers to utilise DL frameworks featuring optimized performance on Intel hardware. This platform can accelerate DL frameworks on Intel architecture and developers can also learn more about this tool through tutorials.  The developer community is also excited about yet another ambitious undertaking from Intel, which will soon be out in beta and that truly takes away the complexity brought on by heterogeneous architectures. OneAPI, one of the most ground-breaking multi-year software projects from Intel, offers a single programming methodology across heterogeneous architectures. The end benefit to application developers is that they need no longer maintain separate code bases, multiple programming languages, and different tools and workflows which means that they can now get maximum performance out of their hardware. As Prakash Mallya, Vice President and Managing Director, Sales and Marketing Group, Intel India, explains, “The magic of OneAPI is that it takes away the complexity of the programme and developers can take advantage of the heterogeneity of architectures which implies they can use the architecture that best fits their usage model or use case. It is an ambitious multi-year project and we are committed to working through it every single day to ensure we simplify and not compromise our performance.” According to Bilani, the bottomline of leveraging OneAPI is that it provides an abstracted, unified programming language that actually delivers a one view/OneAPI across all the various architectures. OneAPI will be out in beta in October. How Intel Is Reimagining Computing As architectures get more diverse, Intel is doubling down on a broader roadmap for domain-specific architectures coupled with simplified software tools (libraries and frameworks) that enable abstraction and faster prototyping across its comprehensive AI solutions stack. The company is also scaling adoption of its hardware assets — CPUs, FPGAs, VPUs and the soon to be released Intel Nervana™ Neural Network Processor product line. As Mallya puts it, “Hardware is foundational to our company. We have been building architectures for the last 50 years and we are committed to doing that in the future but if there is one thing I would like to reinforce, it is that in an AI-driven world, as data-centric workloads become more diverse, there’s no single architecture that can fit in.”  That’s why Intel focuses on multiple architectures — whether it is scalar (CPU), vector (GPU), matrix (AI) or spatial (FPGA). The Intel team is working towards offering more synchrony between all the hardware layers and software. For example, Intel Xeon Scalable processors have undergone generational improvements and are now seeing a drift towards instructions which are very specific to AI.  Vector Neural Network Instruction (VNNI), built into the 2nd Generation Intel Xeon Scalable processors, delivers enhanced AI performance. Advanced Vector Extensions (AVX), on the other hand, are instructions that have already been a part of Intel Xeon technology for the last five years. While AVX allows engineers to get the performance they need on a Xeon processor, VNNI enables data scientists and machine learning engineers to maximize AI performance. Here’s where Intel is upping the game in terms of heterogeneity — from generic CPUs (2nd Gen Intel Xeon Scalable processors) running specific instructions for AI to actually having a complete product built for both training and inference. Earlier in August at the Hot Chips 2019, Intel announced the Intel Nervana Neural Network processors4, designed from the ground up to run full AI workloads that cannot run on GPUs which are more general purpose. The Bottomline: a) Deploy AI anywhere with unprecedented hardware choice  b) Software capabilities that sit on top of hardware  c) Enriching community support to get up to speed with the latest tools  Winning the AI Race For Intel, the winning factor has been staying closely aligned with its strategy of ‘no one size fits all’ approach and ensuring its evolving portfolio of solutions and products stays AI-relevant. The technology behemoth has been at the forefront of the AI revolution, helping enterprises and startups operationalize AI by reimagining computing and offering full-stack AI solutions, spanning software and hardware that add additional value to end customers. Intel has also heavily built up a complete ecosystem of partnerships and has made significant inroads into specific industry verticals and applications like telecom, healthcare and retail, helping the company drive long-term growth. As Mallya sums up, the way forward is through meaningful collaborations and making the vision of AI for India a reality using powerful best-in-class tools.  Sources 1AI Hardware Summit: https://twitter.com/karlfreund 2Intel Software Solutions: https://software.intel.com/en-us 3Accelerate Vision Anywhere With OpenVINO™ Toolkit: https://www.intel.in/content/www/in/en/internet-of-things/openvino-toolkit.html 4At Hot Chips, Intel Pushes ‘AI Everywhere’: https://newsroom.intel.com/news/hot-chips-2019/#gs.8w7pme Read the full article
0 notes
toldnews-blog · 6 years
Photo
Tumblr media
New Post has been published on https://toldnews.com/business/virtual-cities-designing-the-metropolises-of-the-future/
Virtual cities: Designing the metropolises of the future
Image copyright Getty Images
Image caption The cities of the future will be informed by data as much as by design
Simulation software that can create accurate “digital twins” of entire cities is enabling planners, designers and engineers to improve their designs and measure the effect changes will have on the lives of citizens.
Cities are hugely complex and dynamic creations. They live and breathe.
Think about all the parts: millions of people, schools, offices, shops, parks, utilities, hospitals, homes and transport systems.
Changing one aspect affects many others. Which is why planning is such a hard job.
So imagine having a tool at your disposal that could answer questions such as “What will happen to pedestrian and traffic flow if we put the new metro station here?” or “How can we persuade more people to leave their cars at home when they go to work?”
This is where 3D simulation software is coming into its own.
Architects, engineers, construction companies and city planners have long used computer-aided design and building information modelling software to help them create, plan and construct their projects.
But with the addition of internet of things (IoT) sensors, big data and cloud computing, they can now create “digital twins” of entire cities and simulate how things will look and behave in a wide range of scenarios.
“A digital twin is a virtual representation of physical buildings and assets but connected to all the data and information around those assets, so that machine learning and AI algorithms can be applied to them to help them operate more efficiently,” explains Michael Jansen, chief executive of Cityzenith, the firm behind the Smart World Pro simulation platform.
Take Singapore as an example.
Image copyright NRF Singapore
Image caption The real Singapore has been faithfully recreated in virtual form
Image copyright NRF Singapore
Image caption Planners now have a data-rich simulation of the city to interact with
This island state, sitting at the foot of the Malaysian peninsula with a population of six million people, has developed a virtual digital twin of the entire city using software developed by French firm Dassault Systemes.
“Virtual Singapore is a 3D digital twin of Singapore built on topographical as well as real-time, dynamic data,” explains George Loh, progammes director for the city’s National Research Foundation (NRF), a department within the prime minister’s office.
“It will be the country’s authoritative platform that can be used by urban planners to simulate the testing of innovative solutions in a virtual environment.”
In addition to the usual map and terrain data, the platform incorporates real-time traffic, demographic and climate information, says Mr Loh, giving planners the ability to engage in “virtual experimentation”.
“For example, we can plan barrier-free routes for disabled and elderly people,” he says.
Bernard Charles, Dassault Systemes’ chief executive, says the addition of real-time data from multiple sources facilitates joined-up, holistic thinking.
Image copyright NRF Singapore
Image caption The city envisages Virtual Singapore being used by citizens to locate driverless cars for hire
“The problem is that when we decide about the evolution of a city we are in some way blind. You have the urban view of it – a map – you decide to put a building here, but another agency has to think about transport, another agency has to think about commercial use and flats for people.
“The creation of one thing changes so many other things – the flow and life of citizens.”
The firm’s 3DExperience platform gives planners and designers “a global overview” they’ve never had before, explains Mr Charles.
Dassault’s software, which incorporates calculations that simulate the flow of a fluid, is used to design most F1 cars and aeroplanes, says Mr Charles, and this capability is useful for understanding wind flow around buildings, through streets and green spaces.
Image copyright NRF
Image caption The software can model wind flow through built up areas
“If some parts of a city are too windy and cold, no-one will like to go there,” he says.
Tracking people’s movements through a city using anonymised mobile phone and transport GPS data can help authorities spot bottlenecks and heat maps as the day progresses, hopefully leading to smarter, more integrated transport and traffic management systems.
“You can look at all ‘what if’ scenarios, so if we ask the right question we can change the city, the world,” concludes Mr Charles.
Is India failing to build its newest state capital?
In the state of Andhra Pradesh in India, a brand new $6.5bn “smart city” called Amaravati has been planned since 2015, but has been mired in controversy amid disagreements over the designs and criticism of its environmental impact.
But last year Foster + Partners, the global architecture and engineering firm, and Surbana Jurong, the Asian urban and infrastructure consultancy, were chosen to take on the huge task.
And Chicago-based Cityzenith is providing the single “command and control” digital platform for the entire project.
Image copyright Cityzenith
Image caption Cityzenith’s Smart World Pro platform gives a real-time simulation of the entire Amaravati city project
IoT sensors will monitor construction progress in real time, says Mr Jansen, and the software will integrate all the designs from the 30 or so design consultants already involved in the first phase of the project.
“The portal will simulate the impact of these proposed buildings before anyone even breaks ground,” he says, “and these simulations will adjust to real-time changes.”
The platform can incorporate more than a thousand datasets, says Mr Jansen, and integrate all the various design and planning tools the designers and contractors use.
The city, which will eventually be home to 3.5 million people, will be hot and humid, experiencing temperatures approaching 50C at times, so simulating how buildings will cope with the climate will be crucial, says Mr Jansen.
More Technology of Business
One large Norwegian engineering consultancy, Norconsult, is even combining simulation software with gaming to help improve its designs.
When working on a large rail tunnel project in Norway, the firm developed a virtual reality game to involve train drivers in the design of the signalling system. The drivers operated a virtual train and “drove” it through the tunnel, flagging up any issues with the proposed position of the signals.
Image copyright Norconsult
Image caption Train drivers “drove” a virtual train through the tunnel to test the positions of the signals
“They could change weather conditions, the speed and so on,” says Thomas Angeltveit, who worked on the project. “It feels real, so it is much easier for them to interact.”
“We had a lot of comments, so we were able to change the design and make a lot of adjustments.”
Changing the design before construction begins obviously saves money in the long-term.
Digital twin simulation software is a fast-growing business, with firms such as Siemens, Microsoft and GE joining Dassault Systemes and Cityzenith as lead practitioners.
Research firm Gartner predicts that by 2021 half of large industrial companies will use digital twins and estimates that those that do could save up to 25% in operational running costs as a result.
The future of design is virtual and driven by data it seems.
Follow Matthew on Twitter and Facebook
0 notes
ladystylestores · 4 years
Text
A Silicon Valley for everyone – TechCrunch
Editor’s note: Get this free weekly recap of TechCrunch news that any startup can use by email every Saturday morning (7am PT). Subscribe here.
Many in the tech industry saw the threat of the novel coronavirus early and reacted correctly. Fewer have seemed prepared for its aftereffects, like the outflow of talented employees from very pricey office real estate in expensive and troubled cities like San Francisco.
And few indeed have seemed prepared for the Black Lives Matter protests that have followed the death of George Floyd. This was maybe the easiest to see coming, though, given how visible the structural racism is in cities up and down the main corridors of Silicon Valley.
Today, the combination of politics, the pandemic and the protests feels almost like a market crash for the industry (except many revenues keep going up and to the right). Most every company is now fundamentally reconsidering where it will be located and who it will be hiring — no matter how well it is doing otherwise.
Some, like Google and Thumbtack, have been caught in the awkward position of scaling back diversity efforts as part of pandemic cuts right before making statements in support of the protesters, as Megan Rose Dickey covered on TechCrunch this week. But it is also the pandemic helping to create the focus, as Arlan Hamilton of Backstage Capital tells her:
It is like the world and the country has a front-row seat to what Black people have to witness, take in, and feel all the time. And it was before they were seeing some of it, but they were seeing it kind of protected by us. We were kind of shielding them from some of it… It’s like a VR headset that the country is forced to be in because of COVID. It’s just in their face.
This also putting new scrutiny on how tech is used in policing today. It is renewing questions around who gets to be a VC and who gets funding right when the industry is under new pressure to deliver. It is highlighting solutions that companies can make internally, like this list from BLCK VC on Extra Crunch.
As with police reforms currently in the national debate, some of the most promising solutions are local. Property tax reform, pro-housing activism and sustainable funding for homelessness services are direct ways for the tech industry to address the long history of discrimination where the modern tech industry began, Catherine Bracy of TechEquity writes for TechCrunch. These changes are also what many think would make the Bay Area a more livable place for everyone, including any startup and any tech employee at any tech company (see: How Burrowing Owls Lead To Vomiting Anarchists).
Something to think about as we move on to our next topic — the ongoing wave of tech departures from SF.
Where will VCs follow founders to now?
In this week’s staff survey, we revisit the remote-first dislocation of the tech industry’s core hubs. Danny Crichton observes some of the places that VCs have been leaving town for, and thinks it means bigger changes are underway:
“Are VCs leaving San Francisco? Based on everything I have heard: yes. They are leaving for Napa, leaving for Tahoe, and otherwise heading out to wherever gorgeous outdoor beauty exists in California. That bodes ill for San Francisco’s (and really, South Park’s) future as the oasis of VC.
But the centripetal forces are strong. VCs will congregate again somewhere else, because they continue to have that same need for market intelligence that they have always had. The new, new place might not be San Francisco, but I would be shocked just given the human migration pattern underway that it isn’t in some outlying part of the Bay Area.
And then he says this:
As for VCs — if the new central node is a bar in Napa and that’s the new “place to be” — that could be relatively more permanent. Yet ultimately, VCs follow the founders even if it takes time for them to recognize the new balance of power. It took years for most VCs to recognize that founders didn’t want to work in South Bay, but now nearly every venture firm of note has an office in San Francisco. Where the founders go, the VCs will follow. If that continues to be SF, its future as a startup hub will continue after a brief hiatus.
It’s true that another outlying farming community in the region once became a startup hub, but that one had a major research university next door, and at the time a lot of cheap housing if you were allowed access to it. But Napa cannot be the next Palo Alto because it is fully formed today as a glorified retirement community, Danny.
I’m already on the record for saying that college towns in general are going to become more prominent in the tech world, between ongoing funding for innovative tech work and ongoing desirability for anyone moving from the big cities. But I’m going to add a side bet that cities will come back into fashion with the sorts of startup founders that VCs would like to back. As Exhibit A, I’d like to present Jack Dorsey, who started a courier dispatch in Oakland in 2000, and studied fashion and massage therapy during the aftermath of the dot-com bubble. His success with Twitter a few years later in San Francisco inspired many founders to move as well.
Creative people like him are drawn to the big, creative environments that cities can offer, regardless of what the business establishment thinks. If the public and private sectors can learn from the many mistakes of recent decades (see last item) who knows, maybe we’ll see a more equal and resilient sort of boom emerge in tech’s current core.
Insurance provider Lemonade files for IPO with that refreshing common-stock flavor
There are probably some amazing puns to be made here but it has been a long week, and the numbers speak for themselves. Lemonade sells insurance to renters and homeowners online, and managed to reach a private valuation of $3.5 billion before filing to go public on Monday — with the common stockholders still comprising the majority of the cap table.
Danny crunched the numbers from the S-1 on Extra Crunch to generate the table, included, that illustrates this rather unusual breakdown. Usually, as you almost certainly know already, the investors own well over half by the time of a good liquidity event. “So what was the magic with Lemonade?” he ponders. “One piece of the puzzle is that company founder Daniel Schreiber was a multi-time operator, having previously built Powermat Technologies as the company’s president. The other piece is that Lemonade is built in the insurance market, which can be carefully modeled financially and gives investors a rare repeatable business model to evaluate.”
(Photo by Paul Hennessy/NurPhoto via Getty Images)
Adapting enterprise product roadmaps to the pandemic
Our investor surveys for Extra Crunch this week covered the space industry’s startup opportunities, and looked at how enterprise investors are assessing the impact of the pandemic. Here’s Theresia Gouw of Acrew Capital, explaining how two of their portfolio companies have refocused in recent months:
A common theme we found when joining our founders for these strategy sessions was that many pulled forward and prioritized mid- to long-term projects where the product features might better fit the needs of their customers during these times. One such example in our portfolio is Petabyte’s (whose product is called Rhapsody) accelerated development of its software capabilities that enable veterinarians to provide telehealth services. Rhapsody has also incorporated key features that enable a contactless experience when telehealth isn’t sufficient. These include functionality that enables customers to check-in (virtual waiting room), sign documents, and make payments from the comfort and safety of their car when bringing their pet (the patient!) to the vet for an in-person check-up.
Another such example would be PredictHQ, which provides demand intelligence to enterprises in travel, hospitality, logistics, CPG, and retail, all sectors who saw significant change (either positive or negative) in the demand for their products and services. PredictHQ has the most robust global dataset on real-world events. Pandemics and all the ensuing restrictions and, then, loosening of restrictions fall within the category of real-world events. The company, which also has multiple global offices, was able to incorporate the dynamic COVID government responses on a hyperlocal basis, by geography, and equip its customers (e.g., Domino’s, Qantas, and First Data) with up to date insights that would help with demand planning and forecasting as well as understanding staffing needs.
Around TechCrunch
Extra Crunch Live: Join Superhuman CEO Rahul Vohra for a live Q&A on June 16 at 2pm EDT/11 AM PDT Join us for a live Q&A with Plaid CEO Zach Perret June 18 at 10 a.m. PDT/1 p.m. EDT Two weeks left to save on TC Early Stage passes Learn how to ‘nail it before you scale it’ with Floodgate’s Ann Miura-Ko at TC Early Stage SF How can startups reinvent real estate? Learn how at TechCrunch Disrupt Stand out from the crowd: Apply to TC Top Picks at Disrupt 2020
Across the Week
TechCrunch
Theaters are ready to reopen, but is America ready to go back to the movies? Edtech is surging, and parents have some notes When it comes to social media moderation, reach matters Zoom admits to shutting down activist accounts at the request of the Chinese government
Extra Crunch
TechCrunch’s top 10 picks from Techstars’ May virtual demo days Software’s meteoric rise: Have VCs gone too far? Recession-proof your software engineering career The complicated calculus of taking Facebook’s venture money The pace of startup layoffs may be slowing down
#EquityPod
From Alex:
Hello and welcome back to Equity, TechCrunch’s venture capital-focused podcast, where we unpack the numbers behind the headlines.
After a pretty busy week on the show we’re here with our regular Friday episode, which means lots of venture rounds and new venture capital funds to dig into. Thankfully we had our full contingent on hand: Danny “Well, you see” Crichton, Natasha “Talk to me post-pandemic” Mascarenhas, Alex “Very shouty” Wilhelm and, behind the scenes, Chris “The Dad” Gates.
Make sure to check out our IPO-focused Equity Shot from earlier this week if you haven’t yet, and let’s get into today’s topics:
Instacart raises $225 million. This round, not unexpected, values the on-demand grocery delivery startup at $13.7 billion — a huge sum, and one that should make it harder for the well-known company to sell itself to anyone but the public markets. Regardless, COVID-19 gave this company a huge updraft, and it capitalized on it.
Pando raises $8.5 million. We often cover rounds on Equity that are a little obvious. SaaS, that sort of thing. Pando is not that. Instead, it’s a company that wants to let small groups of individual pool their upside and allow for more equal outcomes in an economy that rewards outsized success.
Ethena raises $2 million. Anti-harassment software is about as much fun as the dentist today, but perhaps that doesn’t have to be the case. Natasha talked us through the company, and its pricing. I’m pretty bullish on Ethena, frankly. Homebrew, Village Global and GSV took part in the financing event.
Vendr raises $4 million. Vendr wants to help companies cut their SaaS bills, through its own SaaS-esque product. I tried to explain this, but may have butchered it a bit. It’s cool, I promise.
Facebook is getting into the CVC game. This should not be a surprise, but we were also not sure who was going to want Facebook money.
And, finally, Collab Capital is raising a $50 million fund to invest in Black founders. Per our reporting, the company is on track to close on $10 million in August. How fast the fund can close its full target is something we’re going to keep an eye on, considering it might get a lot harder a lot sooner. 
And that is that; thanks for lending us your ears.
Equity drops every Friday at 6:00 am PT, so subscribe to us on Apple Podcasts, Overcast, Spotify and all the casts.
Source link
قالب وردپرس
from World Wide News https://ift.tt/2XXCWXM
0 notes
hudsonespie · 5 years
Text
Study: Around the World, Sustainable Fisheries Management is Working
Fisheries around the world are in better health than most people realize, according to a new study published last month in the journal PNAS. The study is the latest comprehensive health assessment of the world’s fish populations, and the data paints an improving picture, with many fisheries now able to provide a sustainable catch.
“There is a narrative that fish stocks are declining around the world, that fisheries management is failing, and we need new solutions. And it’s totally wrong,” said lead author Ray Hilborn, a fisheries expert at the University of Washington, who led the study. “Fish stocks are not all declining around the world. They are increasing in many places, and we already know how to solve problems through effective fisheries management.”
Key fishing grounds in Europe, South America and Africa are among those found to have healthy or improving numbers. But the good news has limits. The status of many unmanaged fisheries, especially those in South and Southeast Asia, are unclear. As global trade continues to increase demand, these regions are most likely being overexploited.
Compiled by fisheries scientists from around the world, the new analysis looked at data on 882 fish stocks, including information available for the first time about catches from Peru, Chile, Japan, Russia, north-west Africa and the Mediterranean and Black seas. The researchers then compared this to details of fisheries management in about 30 countries. They found that more intense management led to healthy or improving fish populations, while little to no management led to overfishing.
The study concludes: “The efforts of the thousands of managers, scientists, fishers and non-governmental organization workers have resulted in significantly improved statuses of fisheries in much of the developed world, and increasingly in the developing world.”
The study shows something else too: consensus and cooperation continues between two distinct camps of fisheries experts previously in conflict. The two sides – who disagreed on the health and likely future prospects of global fish populations – first combined to offer a joint assessment a decade ago. That 2009 analysis concluded many depleted fisheries were making good progress towards recovery. But the data used only covered about 20% of the world’s catch. In other words, the status of 80 percent of the fish landed every year across the globe remained a mystery.
Last week’s study is put together by a similar team of researchers and significantly extends the dataset, which now contains information on about half the world’s catch. The results, Hilborn says, show that consumers in the developed world – including in North America and Britain – can now buy many fish species with a clear conscience. “If you want to be very careful, you need to look at exactly what species it is,” he says. “But as a general rule, particularly those of us in the West, we’re largely eating fish that come from well-managed fisheries.”
There are some important exceptions. For example, shrimp is the most popular seafood in the US, and the majority is imported from unmanaged fisheries in Southeast Asia.
“Many of the countries that have made progress domestically still import from countries where the situation isn’t as nice. That is something else we should be conscious of,” says Beth Fulton, a marine scientist with the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Hobart, Australia, who was not involved with the new study.
She adds: “Serious effort has to go into helping nations which do not currently have significant fisheries management capacity to tackle the issues they face, which go beyond a lack of resources.”
Many important fisheries are not included in the new dataset, sometimes because dozens of different species of fish are caught at the same time. That type of fishing activity is more difficult to track as management schemes typically focus on fisheries where a single species is targeted, such as cod or tuna.
Hilborn says: “The unassessed fisheries are largely highly mixed fisheries. They may catch a hundred species in one haul of the net, and you can’t regulate those on a species-by-species basis. So the toolkit for managing those fisheries is going to be different than what we dominantly use in the successes we’ve had so far.”
Unassessed fisheries in India, Indonesia and China represent 30-40% of the world’s fish catch. “China is a big black box. It’s the biggest fishing country in the world. And they have essentially no publicly available assessments of their resources,” Hilborn says.
Steve Palumbi, a fisheries scientist at Stanford University, says some caution is also needed with the data where they do exist. “I’m not as convinced that this shows the universal success of fisheries management schemes,” he says. “Because regional data from the same countries – mostly the US and Canada – show different patterns.” East coast fisheries in both countries have not responded well, whereas west coast fisheries, and Alaska have done better. “It may well be that there has not been enough time for the effect of management to take hold in the eastern fisheries, perhaps because they were so far down to begin with,” he says.
Reg Watson, a marine researcher at the University of Tasmania, says scientists tend to think about fisheries in two distinct ways. “One tries to save the oceans and all its life from the destruction of fishing. While the other tries to focus on the stocks that feed us and provide jobs and support to the millions around the world,” he says. “The typical uncertainty associated with grand assessments of the world’s ocean life leave room for both.”
Focusing on fish stocks might show that a fishery can provide a sustainable supply, he says, but such data don’t necessarily offer a true picture of the health of a marine ecosystem – “like our terrestrial systems they have likely been greatly simplified and now lack much of the diversity and resilience they once had.” Watson adds: “This could be very important in the near future.”
David Adam is a freelance journalist based near London. This article appears courtesy of China Dialogue Ocean and may be found in its original form here. 
from Storage Containers https://maritime-executive.com/article/study-around-the-world-sustainable-fisheries-management-is-working via http://www.rssmix.com/
0 notes
shuga-hill · 5 years
Link
Making Algorithms More Like Kids: What Can Four-Year-Olds Do That AI Can’t?
Thomas Hornigold
Jun 26, 2019
Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s? If this were then subjected to an appropriate course of education one would obtain the adult brain.
Alan Turing famously wrote this in his groundbreaking 1950 paper Computing Machinery and Intelligence, and laid the framework for generations of machine learning scientists to follow. Yet, despite increasingly impressive specialized applications and breathless predictions, we’re still some distance from programs that can simulate any mind, even one much less complex than a human’s.
Perhaps the key came in what Turing said next: “Our hope is that there is so little mechanism in the child brain that something like it can be easily programmed.” This seems, in hindsight, naive. Moravec’s paradox applies: things that seem like the height of human intellect, like a good stimulating game of chess, are easy for machines, while simple tasks can be extremely difficult. But if children are our template for the simplest general human-level intelligence we might program, then surely it makes sense for AI researchers to study the many millions of existing examples.
This is precisely what Professor Alison Gopnik and her team at Berkeley do. They seek to answer the question: how sophisticated are children as learners? Where are children still outperforming the best algorithms, and how do they do it?
General, Unsupervised Learning
Some of the answers were outlined in a recent talk at the International Conference on Machine Learning. The first and most obvious difference between four-year-olds and our best algorithms is that children are extremely good at generalizing from a small set of examples. ML algorithms are the opposite: they can extract structure from huge datasets that no human could ever process, but generally large amounts of training data are needed for good performance.
This training data usually has to be labeled, although unsupervised learning approaches are also making progress. In other words, there is often a strong “supervisory signal” coded into the algorithm and its dataset, consistently reinforcing the algorithm as it improves. Children can learn to perform generally on a wide variety of tasks with very little supervision, and they can generalize what they’ve learned to new situations they’ve never seen before.
Even in image recognition, where ML has made great strides, algorithms require a large set of images before they can confidently distinguish objects; children may only need one. How is this achieved?
Professor Gopnik and others argue that children have “abstract generative models” that explain how the world works. In other words, children have imagination: they can ask themselves abstract questions like “If I touch this sharp pin, what will happen?” And then, from very small datasets and experiences, they can anticipate the solution.
In doing so, they are correctly inferring the relationship between cause and effect from experience. Children know that the reason that this object will prick them unless handled with care is because it’s pointy, and not because it’s silver or because they found it in the kitchen. This may sound like common sense, but being able to make this kind of causal inference from small datasets is still hard for algorithms to do, especially across such a wide range of situations.
The Power of Imagination
Generative models are increasingly being employed by AI researchers—after all, the best way to show that you understand the structure and rules of a dataset is to produce examples that obey those rules. Such neural networks can compress hundreds of gigabytes of image data into hundreds of megabytes of statistical parameter weights and learn to produce images that look like the dataset. In this way, they “learn” something of the statistics of how the world works. But to do what children can and generalize with generative models is computationally infeasible, according to Gopnik.
This is far from the only trick children have up their sleeve which machine learning hopes to copy. Experiments from Professor Gopnik’s lab show that children have well-developed Bayesian reasoning abilities. Bayes’ theorem is all about assimilating new information into your assessment of what is likely to be true based on your prior knowledge. For example, finding an unfamiliar pair of underwear in your partner’s car might be a worrying sign—but if you know that they work in dry-cleaning and use the car to transport lost clothes, you might be less concerned.
Scientists at Berkeley present children with logical puzzles, such as machines that can be activated by placing different types of blocks or complicated toys that require a certain sequence of actions to light up and make music.
When they are given several examples (such as a small dataset of demonstrations of the toy), they can often infer the rules behind how the new system works from the age of three or four. These are Bayesian problems: the children efficiently assimilate the new information to help them understand the universal rules behind the toys. When the system isn’t explained, the children’s inherent curiosity leads them to experimenting with these systems—testing different combinations of actions and blocks—to quickly infer the rules behind how they work.
Indeed, it’s the curiosity of children that actually allows them to outperform adults in certain circumstances. When an incentive structure is introduced—i.e. “points” that can be gained and lost depending on your actions—adults tend to become conservative and risk-averse. Children are more concerned with understanding how the system works, and hence deploy riskier strategies. Curiosity may kill the cat, but in the right situation, it can allow children to win the game by identifying rules that adults miss by avoiding any action that might result in punishment.
To Explore or to Exploit?
This research shows not only the innate intelligence of children, but also touches on classic problems in algorithm design. The explore-exploit problem is well known in machine learning. Put simply, if you only have a certain amount of resources-time, computational ability, etc.—are you better off searching for new strategies, or simply taking the path that seems to most obviously lead to gains?
Children favor exploration over exploitation. This is how they learn—through play and experimentation with their surroundings, through keen observation and asking as many questions as they can. Children are social learners: as well as interacting with their environment, they learn from others. Anyone who has ever had to deal with a toddler endlessly using that favorite word, “why?”, will recognize this as a feature of how children learn! As we get older—kicking in around adolescence in Gopnik’s experiments—we switch to exploiting the strategies we’ve already learned rather than taking those risks.
These concepts are already being imitated in machine learning algorithms. One example is the idea of “temperature” for algorithms that look through possible solutions to a problem to find the best one. A high-temperature search is more likely to pick a random move that might initially take you further away from the reward. This means that the optimization is less likely to get “stuck” on a particular solution that’s hard to improve upon, but may not be the best out there—but it’s also slower to find a solution. Meanwhile, searches with lower temperature take fewer “risky” random moves and instead seek to refine what’s already been found.
In many ways, humans develop in the same way, from high-temperature toddlers who bounce around playing with new ideas and new solutions even when they seem strange to low-temperature adults who take fewer risks, are more methodical, but also less creative. This is how we try to program our machine learning algorithms to behave as well.
It’s nearly 70 years since Turing first suggested that we could create a general intelligence by simulating the mind of a child. The children he looked to for inspiration in 1950 are all knocking on the door of old age today. Yet, for all that machine learning and child psychology have developed over the years, there’s still a great deal that we don’t understand about how children can be such flexible, adaptive, and effective learners.
Understanding the learning process and the minds of children may help us to build better algorithms, but it could also help us to teach and nurture better and happier humans. Ultimately, isn’t that what technological progress is supposed to be about?
0 notes
Text
Using Python to recover SEO site traffic (Part three) Search Engine Watch
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter .
Want to stay on top of the latest search trends?
Get top insights and news from our search experts.
Related reading
Complete overivew of what Google Search Console is, what it does for your site, how to use it, and what you need to get started taking advantage of it today.
Last month, Google tested AR functionality in Google Maps. What are the implications of VPS, street view, and machine learning for local search and SEOs?
The robots.txt file is an often overlooked and sometimes forgotten part of a website and SEO. Here’s what it is, examples, how to’s, and tips for success.
What exactly agencies need when it comes to website audits and what to look for in choosing a tool. Five specific recommendations, screenshots, examples.
Want to stay on top of the latest search trends?
Get top insights and news from our search experts.
Source link
0 notes
alanajacksontx · 5 years
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from IM Tips And Tricks https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/ from Rising Phoenix SEO https://risingphxseo.tumblr.com/post/184297809275
0 notes
kellykperez · 5 years
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
source https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/ from Rising Phoenix SEO http://risingphoenixseo.blogspot.com/2019/04/using-python-to-recover-seo-site.html
0 notes
bambiguertinus · 5 years
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
evaaguilaus · 5 years
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
oscarkruegerus · 5 years
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Digtal Marketing News https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
sheilalmartinia · 5 years
Text
Using Python to recover SEO site traffic (Part three)
When you incorporate machine learning techniques to speed up SEO recovery, the results can be amazing.
This is the third and last installment from our series on using Python to speed SEO traffic recovery. In part one, I explained how our unique approach, that we call “winners vs losers” helps us quickly narrow down the pages losing traffic to find the main reason for the drop. In part two, we improved on our initial approach to manually group pages using regular expressions, which is very useful when you have sites with thousands or millions of pages, which is typically the case with ecommerce sites. In part three, we will learn something really exciting. We will learn to automatically group pages using machine learning.
As mentioned before, you can find the code used in part one, two and three in this Google Colab notebook.
Let’s get started.
URL matching vs content matching
When we grouped pages manually in part two, we benefited from the fact the URLs groups had clear patterns (collections, products, and the others) but it is often the case where there are no patterns in the URL. For example, Yahoo Stores’ sites use a flat URL structure with no directory paths. Our manual approach wouldn’t work in this case.
Fortunately, it is possible to group pages by their contents because most page templates have different content structures. They serve different user needs, so that needs to be the case.
How can we organize pages by their content? We can use DOM element selectors for this. We will specifically use XPaths.
For example, I can use the presence of a big product image to know the page is a product detail page. I can grab the product image address in the document (its XPath) by right-clicking on it in Chrome and choosing “Inspect,” then right-clicking to copy the XPath.
We can identify other page groups by finding page elements that are unique to them. However, note that while this would allow us to group Yahoo Store-type sites, it would still be a manual process to create the groups.
A scientist’s bottom-up approach
In order to group pages automatically, we need to use a statistical approach. In other words, we need to find patterns in the data that we can use to cluster similar pages together because they share similar statistics. This is a perfect problem for machine learning algorithms.
BloomReach, a digital experience platform vendor, shared their machine learning solution to this problem. To summarize it, they first manually selected cleaned features from the HTML tags like class IDs, CSS style sheet names, and the others. Then, they automatically grouped pages based on the presence and variability of these features. In their tests, they achieved around 90% accuracy, which is pretty good.
When you give problems like this to scientists and engineers with no domain expertise, they will generally come up with complicated, bottom-up solutions. The scientist will say, “Here is the data I have, let me try different computer science ideas I know until I find a good solution.”
One of the reasons I advocate practitioners learn programming is that you can start solving problems using your domain expertise and find shortcuts like the one I will share next.
Hamlet’s observation and a simpler solution
For most ecommerce sites, most page templates include images (and input elements), and those generally change in quantity and size.
I decided to test the quantity and size of images, and the number of input elements as my features set. We were able to achieve 97.5% accuracy in our tests. This is a much simpler and effective approach for this specific problem. All of this is possible because I didn’t start with the data I could access, but with a simpler domain-level observation.
I am not trying to say my approach is superior, as they have tested theirs in millions of pages and I’ve only tested this on a few thousand. My point is that as a practitioner you should learn this stuff so you can contribute your own expertise and creativity.
Now let’s get to the fun part and get to code some machine learning code in Python!
Collecting training data
We need training data to build a model. This training data needs to come pre-labeled with “correct” answers so that the model can learn from the correct answers and make its own predictions on unseen data.
In our case, as discussed above, we’ll use our intuition that most product pages have one or more large images on the page, and most category type pages have many smaller images on the page.
What’s more, product pages typically have more form elements than category pages (for filling in quantity, color, and more).
Unfortunately, crawling a web page for this data requires knowledge of web browser automation, and image manipulation, which are outside the scope of this post. Feel free to study this GitHub gist we put together to learn more.
Here we load the raw data already collected.
Feature engineering
Each row of the form_counts data frame above corresponds to a single URL and provides a count of both form elements, and input elements contained on that page.
Meanwhile, in the img_counts data frame, each row corresponds to a single image from a particular page. Each image has an associated file size, height, and width. Pages are more than likely to have multiple images on each page, and so there are many rows corresponding to each URL.
It is often the case that HTML documents don’t include explicit image dimensions. We are using a little trick to compensate for this. We are capturing the size of the image files, which would be proportional to the multiplication of the width and the length of the images.
We want our image counts and image file sizes to be treated as categorical features, not numerical ones. When a numerical feature, say new visitors, increases it generally implies improvement, but we don’t want bigger images to imply improvement. A common technique to do this is called one-hot encoding.
Most site pages can have an arbitrary number of images. We are going to further process our dataset by bucketing images into 50 groups. This technique is called “binning”.
Here is what our processed data set looks like.
Adding ground truth labels
As we already have correct labels from our manual regex approach, we can use them to create the correct labels to feed the model.
We also need to split our dataset randomly into a training set and a test set. This allows us to train the machine learning model on one set of data, and test it on another set that it’s never seen before. We do this to prevent our model from simply “memorizing” the training data and doing terribly on new, unseen data. You can check it out at the link given below:
Model training and grid search
Finally, the good stuff!
All the steps above, the data collection and preparation, are generally the hardest part to code. The machine learning code is generally quite simple.
We’re using the well-known Scikitlearn python library to train a number of popular models using a bunch of standard hyperparameters (settings for fine-tuning a model). Scikitlearn will run through all of them to find the best one, we simply need to feed in the X variables (our feature engineering parameters above) and the Y variables (the correct labels) to each model, and perform the .fit() function and voila!
Evaluating performance
After running the grid search, we find our winning model to be the Linear SVM (0.974) and Logistic regression (0.968) coming at a close second. Even with such high accuracy, a machine learning model will make mistakes. If it doesn’t make any mistakes, then there is definitely something wrong with the code.
In order to understand where the model performs best and worst, we will use another useful machine learning tool, the confusion matrix.
When looking at a confusion matrix, focus on the diagonal squares. The counts there are correct predictions and the counts outside are failures. In the confusion matrix above we can quickly see that the model does really well-labeling products, but terribly labeling pages that are not product or categories. Intuitively, we can assume that such pages would not have consistent image usage.
Here is the code to put together the confusion matrix:
Finally, here is the code to plot the model evaluation:
Resources to learn more
You might be thinking that this is a lot of work to just tell page groups, and you are right!
Mirko Obkircher commented in my article for part two that there is a much simpler approach, which is to have your client set up a Google Analytics data layer with the page group type. Very smart recommendation, Mirko!
I am using this example for illustration purposes. What if the issue requires a deeper exploratory investigation? If you already started the analysis using Python, your creativity and knowledge are the only limits.
If you want to jump onto the machine learning bandwagon, here are some resources I recommend to learn more:
Attend a Pydata event I got motivated to learn data science after attending the event they host in New York.
Hands-On Introduction To Scikit-learn (sklearn)
Scikit Learn Cheat Sheet
Efficiently Searching Optimal Tuning Parameters
If you are starting from scratch and want to learn fast, I’ve heard good things about Data Camp.
Got any tips or queries? Share it in the comments.
Hamlet Batista is the CEO and founder of RankSense, an agile SEO platform for online retailers and manufacturers. He can be found on Twitter @hamletbatista.
The post Using Python to recover SEO site traffic (Part three) appeared first on Search Engine Watch.
from Search Engine Watch https://searchenginewatch.com/2019/04/17/using-python-to-recover-seo-site-traffic-part-three/
0 notes
click2watch · 6 years
Text
This Scaling Tech Could Let You Sync Bitcoin Straight From Your Phone
“Maybe we don’t have to store everything ourselves.”
That’s Tadge Dryja, cryptocurrency research scientist at the MIT Digital Currency Initiative, explaining the concept behind his bitcoin scaling solution, “utrexxo.”
Based on an idea that has been pursued by developers for many years, utrexxo seeks to streamline an aspect of bitcoin’s code that leads to heavy storage requirements over time.
Simply put, it addresses what is known as the UTXO set – or the code that gives information on whether a bitcoin has been spent.
Currently, bitcoin nodes must download the entirety of this information, what is known as the “state,” in order to verify it.
With utrexxo, though, rather than having to download the entirety of the bitcoin state, bitcoin holders could simply verify if it is correct using a cryptographic proof. This approach could minimize storage requirements to the extent that it might even be possible to run bitcoin on a mobile phone.
Also known as an accumulator, the tech underpinning utexxo isn’t a new idea – developers have been discussing ways to implement similar kinds of code since bitcoin’s early days – but it was previously met with hurdles to implementation.
Now, – due to work by Dryja and others – it is swiftly becoming a reality. In an early prototype, Dryja has created functioning proof-of-concept code.
And he’s not alone. Dryja is joined by cryptography heavyweights Dan Boneh, Benedikt Bünz and Ben Fisch, who have written a paper detailing an alternate accumulator method.
“The high-level goal is basically your phone could run a full node. That is the dream,” Bünz, who is known for his work on bulletproofs, a scaling tech that allowed monero to reduce transaction fees by 96 percent, told CoinDesk.
Bünz’s paper has even been picked up by ethereum researchers, who are investigating how the technology might apply to layer two scaling solution, Plasma.
And part of this flurry of activity stems from the fact that due to the nature of the technology, it doesn’t require a hard fork – a type of software update that requires unanimous support and participation – in order to safely activate. Instead, accumulators would be deployed at the wallet level, which significantly reduces the hurdle to implementation.
“Hard forks are almost impossible on bitcoin. Soft forks are hard as well,” Bünz said, dding:
“It’s great that we can just deploy it, it makes it a lot easier and it means we can have a competition of ideas.”
Growing bigger
Stepping back, accumulators have been discussed since as early as 2010, however, were previously met with an insurmountable bottleneck – what is known as a bridge node.
And that’s because, in order to function, accumulators require other people within the network to support the software. While previously, this was highly resource-intensive, Dryja has built a bridge node that doesn’t come with additional trade-offs – meaning that accumulators are now feasible for the first time.
According to Dryja, that’s notable because utrexxo could address what has been a long-term pressure point for bitcoin: its increasing UTXO set.
UTXO – which stands for unspent transaction output – is the data structure that gives information about all the outstanding bitcoins on the network.
While it is known to fluctuate (the UTXO count actually decreased in 2018), the dataset tends to increase alongside bitcoin’s usage. This means that, if left unchecked, it could continue to grow, necessitating ever-increasing storage requirements.
In particular, this is something that concerns what is known as a bitcoin “full node,” a type of node that keeps a history of every transaction ever made on bitcoin. Currently, a full node requires about 200 gigabytes of storage – just beyond what a conventional laptop can store.
With accumulators, though, full nodes no longer need to store all of the blockchain data in order to order to reach consensus about where coins are on the network. Instead, they can simple provide proofs that data is correct.
“The high level is this idea of separating the consensus away from the state,” Bunz summarized, “Anyone can now be a full node without having to store the data.”
Previously, mobile full nodes were addressed by a particular type of client called an SPV client, which requires light wallets to trust other full nodes to have the correct data. Because this comes with decreased security assumptions, accumulators are heralded as a way to achieve this without trade-offs.
“My hope is that the people who are currently running SPV wallets would be able to use [utrexxo] and get the same security of a full node, with the resource requirements that are more similar to SPV,” Dryja summarized.
The competition
But while they are both positioned toward the same goal, there are ways in which Dryja’s utrexxo model and the work by Bunz differ significantly as well.
First and foremost, Dryja’s work stands out from the fact that it is much closer to deployment. For example, it already has a working prototype and functioning code. Equally, it uses simple mathematics – hash functions that are already familiar to bitcoin.
Bunz’s design, on the other hand, is potentially more efficient and boasts more advanced features. Still, it uses mathematics that according to Dryja, is comparatively more risky and exotic compared to his own design.
For example, one stage of Bunz’s accumulators requires a kind of trusted setup – in short, the product of two secret numbers, that if revealed could be compromising to its security.
“We’re using fancier maths to get different properties,” Bunz said,
“The high level differences is [utrexxo] is ready now, it’s based on a simpler thing, it’s based on simple hash function, which is a good thing, but ours has more advanced cool features like batching and aggregating which would be cool at some point.”
Additionally, Bunz’s paper has a section that may have implications for the world’s second largest blockchain, ethereum, as well.
Speaking to CoinDesk, Georgios Konstantopoulos – a researcher and developer for ethereum layer two scaling solution, Plasma – said that due to its applicability, Bunz’s paper had attracted a lot of enthusiasm in the ethereum research community.
For example, Konstantopoulos said that Bunz’s accumulators could even be a more efficient replacement for the most fundamental data structure in ethereum, the Merkle-tree. Additionally, accumulators could help solve a problem inherent to Plasma Cash, which requires users to store large transaction histories.
The enthusiasm was such that Konstantopoulos estimated 10 new designs of how Bunz could apply to ethereum have been proposed, sparking the researcher to undertake a “taxonomy” to analyze the viability of each idea.
He told CoinDesk:
“I’m generally very optimistic that we will find a UXTO compaction scheme for Plasma.”
A ways to go
Still, there’s work that remains on all fronts before the scaling solutions can be considered viable.
Konstantopoulos emphasized that while accumulators could theoretically be useful for ethereum on both layer one and layer two scaling solutions, work remains in order to fully investigate its practical viability.
And both Bunz and Dryja emphasized similar caution as well.
For example, while accumulators have the potential to allow full nodes on mobile phones in terms of storage, they will encounter other hurdles to implementation.
In Dryja’s model, he emphasized that in its current implementation the accumulator is only really useful for bottom of the range computers.
“If you have a fast computer this actually doesn’t help. It will not make much difference or make it slower. But if you have a crummy computer it will make a really big difference,” he continued,
“We want bitcoin to work on crummy computers as well.”
For Bunz’s paper, work remains in order to build a working implementation of the design, which may come with its own unanticipated research problems.
Plus, using the mobile phone as an example, Bunz said that it would be technically feasible to deploy in terms of storage, the phone would need to be constantly online in order to function.
However, Bunz said that such problems can likely be overcome given sufficient research.
“This is one step of the way for getting us to a space where your mobile phone can run a full node,” Bunz said, “There’s nothing theoretically that stands in the way, we just need to be smart about how we do things.”
He continued:
“There needs to be a lot of new innovation happening, but thankfully there is, and it’s really possible.”
Phone image via Shutterstock
!function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n; n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window, document,'script','//connect.facebook.net/en_US/fbevents.js'); fbq('init', '239547076708948'); fbq('track', "PageView"); This news post is collected from CoinDesk
Recommended Read
Editor choice
BinBot Pro – Safest & Highly Recommended Binary Options Auto Trading Robot
Do you live in a country like USA or Canada where using automated trading systems is a problem? If you do then now we ...
9.5
Demo & Pro Version Try It Now
Read full review
The post This Scaling Tech Could Let You Sync Bitcoin Straight From Your Phone appeared first on Click 2 Watch.
More Details Here → https://click2.watch/this-scaling-tech-could-let-you-sync-bitcoin-straight-from-your-phone
0 notes
lucyariablog · 6 years
Text
Are You Really Smart About How AI Works in Marketing?
In its widely talked about State of Marketing Report, Salesforce reports that just over half (51%) of marketers are using AI in one form or another, while another quarter plan to test it over the next two years.
A smaller study of over 500 search, content, and digital marketers by BrightEdge found that just 4% have implemented AI (that’s not a typo).
Who’s right? Salesforce, which reports one in two marketers is using AI, or BrightEdge, which puts the number at one in 25?
The answer may be “neither.” That’s because many marketers (and business leaders as a whole) are confused about which technologies are genuinely AI-powered and which simply rely on advanced algorithms and analytics.
Many marketers are confused about which tech is genuinely AI-powered, says @Clare_mcd. Click To Tweet
As Luis Perez-Breva, head of MIT’s Innovation Teams Program and research scientist at MIT School of Engineering, explains, “Most of what the retail industry refers to as artificial intelligence isn’t AI.” He says many “confuse analyzing large amounts of data and profiling customers for artificial intelligence. Throwing data at machines doesn’t make machines (or anyone) smarter.”
Throwing data at machines doesn’t make machines (or anyone) smarter, says @lpbreva. #intelcontent Click To Tweet
Rather, AI’s promise is what is often called relevance at scale. It’s the ability of machines to crunch massive datasets and data lakes – structured and unstructured data – and optimize decision-making in a way that algorithm-enabled humans cannot achieve. Perhaps most importantly, in an AI-enabled system the machine learns and improves without human input.
Rather than ask, “How many marketers are using AI?,” the more apt question may be, “What are you doing with it?” Let’s examine some of the ways companies are using AI-led initiatives to make the most of AI’s promise.
HANDPICKED RELATED CONTENT:
Should You Trust Artificial Intelligence to Drive Your Content Marketing?
8 Ways Intelligent Marketers Use Artificial Intelligence
Using AI for personalization
Marketers have long practiced personalization in content marketing, developing over time more sophisticated ways of personalizing the customer journey – whether through marketing automation and progressive profiling or using programmatic advertising to support our content path. The idea is that as we learn more about our customer or prospect and fill in information about that person’s needs, budgets, and interests, we can create unique, personalized experiences that educate and delight the person.
Now we are entering the era of hyper-personalization: the ability to personalize not just by persona, profile, or the trail of breadcrumbs people leave on your site, but by a massive set of user details and signals, analyzed and made actionable by machines.
The retail industry is the most talked about application of AI-led personalization, but most examples you read about don’t really fit the definition of AI … they’re just really good personalization.
The examples that seem to cross over – from algorithm-driven personalization to AI-driven personalization – are those in which the AI sifts through data from multiple channels and sources, learning which signals matter in which circumstances and evolving its approach over time. The key variables that influence how one customer interacts with your brand may be completely different from the variables that define another, multiplied millions of times across each person, each channel, and each step of the process – and changing constantly.
HANDPICKED RELATED CONTENT: Cognitive Content Marketing: The Path to a More (Artificially) Intelligent Future
Using AI for voice-searchable entertainment and education
A less common but exciting application for AI-enriched content? Virtual assistants. Alexa (Amazon) offers developers the chance to build “skills” on its platform. Alexa Skills help customers answer questions, gather information, and even control internet-enabled devices and appliances. (To be fair, there’s disagreement about whether Alexa is an AI technology or just an advanced natural language technology – another nod to the problem of assessing AI adoption.)
Companies far and wide are racing to launch Alexa Skills – both to inform and delight customers as well as to test out the channel’s promise.
Companies are racing to launch Alexa Skills to inform and delight customers, says @Clare_mcd. #intelcontent Click To Tweet
Entertainment
Content-rich brands are delivering entertainment and information via Alexa Skills. Disney’s Character of the Day Skill introduces a new character each day from Disney, Pixar, Marvel, and Star Wars. Or you could try out Cat Translator to understand the “why” behind weird cat behavior.
Real-time news
Media companies have been among the first to offer content snippets via Alexa Skills. If you enable the NPR News Hour Skill, for example, you’ll have access to a five-minute news summary, refreshed every hour. Big brands are quickly jumping in too. J.P. Morgan customers can access investment news: “Send me the latest research report from Joyce Chang” or “Send me the tear sheet for eBay.”
Customer service and engagement
Global consumer brands are enabling e-commerce, customer service, and analytics using Alexa Skills. The Capital One Skill lets you ask Alexa, “How much did I spend at Target last month?” or “When is my mortgage payment due?”
For content marketers, there are interesting opportunities to deliver education and entertainment via voice-enabled search. Beauty brand Wunder2 was the first in its segment to launch an Amazon Alexa Skill. The company offers a daily beauty tip via Skills, from how to thicken the appearance of your brows to how to achieve healthier looking hair. As one reviewer explained, “It’s very cool when I can get the latest beauty tips while having my hands free to apply my makeup.”
Wunder2 co-founder and CEO Michael Malinsky tells Forbes, “As a business, we are fascinated with the rapid integration of AI into people’s lives. We think the level of adoption will exceed many people’s expectation and create fluid recommendation experiences using AI technology found in Google Home, Alexa, and the recently launched Apple HomePod. It is something we are absolutely developing already.”
HANDPICKED RELATED CONTENT: How to Set Your Content Free for a Mobile, Voice, Ready-for-Anything Future
Using AI to put email on steroids
For marketers, AI-enabled decision-making for customizing and delivering email (i.e., dynamic emails) could be a game-changer.
Once upon a time, marketers would ask, “What’s the best time of day to send out our email newsletter?” Through trial and error, marketers discovered that certain days and times yielded higher open rates on average.
AI, however, allows marketers to send emails based on the open histories of individual users (or people like him/her in the absence of better data). And no longer will marketers send promotions to huge swaths of their audience. Instead, promotions will be designed uniquely for prospects based on a wide range of signals, from cart abandonment in retail to which times of day an individual is most likely to sign up for a conference. Finally, AI will enable much more customized and nuanced customer journeys. That leads to our next AI application – one which is too often misunderstood.
HANDPICKED RELATED CONTENT: Scale Your B2B Content With Artificial Intelligence: Ideas and Tools Marketers Can Try
Using AI to write
Long decried as evidence that AI will herald in a new soulless age, machine-made content is one of the most controversial applications of AI … but, under the right circumstances, it may be the most pro-creative. Let me explain.
#ArtificialIntelligence under right circumstances, might be the most pro-creative content creator. @Clare_mcd Click To Tweet
As machine-made content becomes better at approximating human language, there’s a clear case for its use in content marketing. Not all content generated by marketing needs to be highly creative and witty, after all. Many organizations are already using machine-generated content, such as Edmunds generating vehicle profiles based on manufacturer data and Homesnap publishing community profiles based on publicly available data. The best applications are those in which there’s a need to publish at scale and the content is somewhat “modular” or easily put together from pieces and parts.
And, if you’re not convinced, perhaps this will change your tune. Even The Washington Post uses machine-generated content. According to Digiday, as of September 2017, the paper’s robot writer (a solution from Heliograph) had published 850 articles and tweets like this one:
Landon beat Whitman 34-0; https://t.co/V6zVPi7a9O @LandonSports @koachkuhn
— WashPost HS Sports (@WashPostHS) September 2, 2017
The key is in how you pair the robot to the writing. For The Washington Post, Heliograph generated articles about local political races, where the paper didn’t have the resources to assign reporters but had data to fill in the story. It also published short summaries about the Olympics in Rio via machine. (The paper reports that four employees previously took 25 hours to collect, analyze, and report on a small portion of local election results. Using Heliograph, The Washington Post created more than 500 articles generating 500,000 views.)
And therein lies the most powerful promise of AI: to release marketers from the mundane to focus on more creative and fulfilling efforts. Marvin Chow, vice president of global marketing at Google, writes that artificial intelligence and machine learning “will spark new ideas and push the boundaries of creativity. With new tools, what will makers, artists, and musicians design? And how will that affect the marketing world we work in?” The full vision is still out of reach, but early signs point to a machine-led period of creative efficiency.
HANDPICKED RELATED CONTENT:
Content Creation Robots Are Here [Examples]
Will Artificial Intelligence Replace Manual Content Creation?
A version of this article originally appeared in the August issue of  Chief Content Officer. Sign up to receive your free subscription to our print magazine every quarter.
Discover more about how to use AI (and how not to use it) at Content Marketing World Sept. 4-7 in Cleveland, Ohio. Register today and use code BLOG100 to save $100.
Cover image by Joseph Kalinowski/Content Marketing Institute
The post Are You Really Smart About How AI Works in Marketing? appeared first on Content Marketing Institute.
from https://contentmarketinginstitute.com/2018/08/ai-works-marketing/
0 notes