#manually sifting through search results to gather relevant data is not only time-consuming but also inefficient. DataScrapingServices.com o
Explore tagged Tumblr posts
datascraping001 · 1 year ago
Text
Google Search Results Data Scraping
Tumblr media
Google Search Results Data Scraping
Harness the Power of Information with Google Search Results Data Scraping Services by DataScrapingServices.com. In the digital age, information is king. For businesses, researchers, and marketing professionals, the ability to access and analyze data from Google search results can be a game-changer. However, manually sifting through search results to gather relevant data is not only time-consuming but also inefficient. DataScrapingServices.com offers cutting-edge Google Search Results Data Scraping services, enabling you to efficiently extract valuable information and transform it into actionable insights.
The vast amount of information available through Google search results can provide invaluable insights into market trends, competitor activities, customer behavior, and more. Whether you need data for SEO analysis, market research, or competitive intelligence, DataScrapingServices.com offers comprehensive data scraping services tailored to meet your specific needs. Our advanced scraping technology ensures you get accurate and up-to-date data, helping you stay ahead in your industry.
List of Data Fields
Our Google Search Results Data Scraping services can extract a wide range of data fields, ensuring you have all the information you need:
-Business Name: The name of the business or entity featured in the search result.
- URL: The web address of the search result.
- Website: The primary website of the business or entity.
- Phone Number: Contact phone number of the business.
- Email Address: Contact email address of the business.
 - Physical Address: The street address, city, state, and ZIP code of the business.
- Business Hours: Business operating hours
- Ratings and Reviews: Customer ratings and reviews for the business.
- Google Maps Link: Link to the business’s location on Google Maps.
- Social Media Profiles: LinkedIn, Twitter, Facebook
These data fields provide a comprehensive overview of the information available from Google search results, enabling businesses to gain valuable insights and make informed decisions.
Benefits of Google Search Results Data Scraping
1. Enhanced SEO Strategy
Understanding how your website ranks for specific keywords and phrases is crucial for effective SEO. Our data scraping services provide detailed insights into your current rankings, allowing you to identify opportunities for optimization and stay ahead of your competitors.
2. Competitive Analysis
Track your competitors’ online presence and strategies by analyzing their rankings, backlinks, and domain authority. This information helps you understand their strengths and weaknesses, enabling you to adjust your strategies accordingly.
3. Market Research
Access to comprehensive search result data allows you to identify trends, preferences, and behavior patterns in your target market. This information is invaluable for product development, marketing campaigns, and business strategy planning.
4. Content Development
By analyzing top-performing content in search results, you can gain insights into what types of content resonate with your audience. This helps you create more effective and engaging content that drives traffic and conversions.
5. Efficiency and Accuracy
Our automated scraping services ensure you get accurate and up-to-date data quickly, saving you time and resources.
Best Google Data Scraping Services
Scraping Google Business Reviews
Extract Restaurant Data From Google Maps
Google My Business Data Scraping
Google Shopping Products Scraping
Google News Extraction Services
Scrape Data From Google Maps
Google News Headline Extraction   
Google Maps Data Scraping Services
Google Map Businesses Data Scraping
Google Business Reviews Extraction
Best Google Search Results Data Scraping Services in USA
Dallas, Portland, Los Angeles, Virginia Beach, Fort Wichita, Nashville, Long Beach, Raleigh, Boston, Austin, San Antonio, Philadelphia, Indianapolis, Orlando, San Diego, Houston, Worth, Jacksonville, New Orleans, Columbus, Kansas City, Sacramento, San Francisco, Omaha, Honolulu, Washington, Colorado, Chicago, Arlington, Denver, El Paso, Miami, Louisville, Albuquerque, Tulsa, Springs, Bakersfield, Milwaukee, Memphis, Oklahoma City, Atlanta, Seattle, Las Vegas, San Jose, Tucson and New York.
Conclusion
In today’s data-driven world, having access to detailed and accurate information from Google search results can give your business a significant edge. DataScrapingServices.com offers professional Google Search Results Data Scraping services designed to meet your unique needs. Whether you’re looking to enhance your SEO strategy, conduct market research, or gain competitive intelligence, our services provide the comprehensive data you need to succeed. Contact us at [email protected] today to learn how our data scraping solutions can transform your business strategy and drive growth.
Website: Datascrapingservices.com
#Google Search Results Data Scraping#Harness the Power of Information with Google Search Results Data Scraping Services by DataScrapingServices.com. In the digital age#information is king. For businesses#researchers#and marketing professionals#the ability to access and analyze data from Google search results can be a game-changer. However#manually sifting through search results to gather relevant data is not only time-consuming but also inefficient. DataScrapingServices.com o#enabling you to efficiently extract valuable information and transform it into actionable insights.#The vast amount of information available through Google search results can provide invaluable insights into market trends#competitor activities#customer behavior#and more. Whether you need data for SEO analysis#market research#or competitive intelligence#DataScrapingServices.com offers comprehensive data scraping services tailored to meet your specific needs. Our advanced scraping technology#helping you stay ahead in your industry.#List of Data Fields#Our Google Search Results Data Scraping services can extract a wide range of data fields#ensuring you have all the information you need:#-Business Name: The name of the business or entity featured in the search result.#- URL: The web address of the search result.#- Website: The primary website of the business or entity.#- Phone Number: Contact phone number of the business.#- Email Address: Contact email address of the business.#- Physical Address: The street address#city#state#and ZIP code of the business.#- Business Hours: Business operating hours#- Ratings and Reviews: Customer ratings and reviews for the business.
0 notes
omindai · 21 days ago
Text
How AI is Revolutionizing Social Media Marketing
Social media has become an indispensable arena for businesses seeking to connect with their target audience, build brand awareness, and drive sales. However, the sheer volume of data generated across these platforms can be overwhelming. Sifting through countless mentions, comments, and posts to extract meaningful insights is a monumental task. This is where Artificial Intelligence (AI) steps in, transforming the landscape of social media marketing with its powerful analytical capabilities.
Tumblr media
AI is no longer a futuristic concept; it's a present-day reality deeply impacting how brands understand and engage with their customers online. By leveraging AI-powered tools, marketers can gain a deeper understanding of customer sentiment, identify emerging trends, and optimize their campaigns for maximum impact. This article explores the various ways AI is revolutionizing social media marketing and the benefits it offers to businesses of all sizes.
Unlocking Insights with AI-Driven Social Media Listening
One of the most significant applications of AI in social media marketing is AI-driven social media listening. Traditional social media monitoring often involves manual keyword searches and analysis, which can be time-consuming and prone to human error. AI, on the other hand, can automate the process, analyzing vast amounts of data in real-time to identify relevant conversations, brand mentions, and industry trends.
These advanced tools go beyond simple keyword matching. They utilize Natural Language Processing (NLP) and machine learning algorithms to understand the context and sentiment behind online conversations. This allows marketers to accurately gauge public perception of their brand, identify potential crises, and track the effectiveness of their marketing campaigns.
Beyond Monitoring: The Power of Social Listening for Brand Strategy
Social listening for brand strategy is no longer just about reacting to negative mentions. AI-powered tools transform social data into actionable insights that can inform strategic decision-making. By analyzing customer conversations, brands can identify unmet needs, discover new product opportunities, and refine their messaging to resonate more effectively with their target audience.
For instance, AI can identify trending topics related to a particular industry or product category. This allows brands to create relevant content and participate in conversations that are already capturing the attention of their target audience. Furthermore, AI can analyze competitor activity, providing valuable insights into their strategies and identifying opportunities for differentiation.
Optimizing Campaigns with AI-Powered Analytics
AI is not only valuable for gathering insights but also for optimizing social media campaigns in real-time. AI-powered analytics tools can track the performance of different ad creatives, targeting options, and posting schedules, identifying what works best and making automated adjustments to improve results.
This data-driven approach allows marketers to maximize their return on investment and ensure that their campaigns are reaching the right audience with the right message. AI can also personalize the user experience by delivering tailored content and offers based on individual preferences and behaviors.
Choosing the Right Social Media Monitoring Solutions
With a growing number of social media monitoring solutions powered by AI, selecting the right platform for your business is crucial. When evaluating different options, consider the following factors:
Accuracy: The platform should accurately identify relevant conversations and analyze sentiment with a high degree of precision.
Scalability: The solution should be able to handle large volumes of data and scale as your business grows.
Integration: The platform should integrate seamlessly with your existing marketing tools and workflows.
Customization: The tool should offer customization options to tailor the analysis to your specific business needs.
Reporting: The platform should provide clear and actionable reports that highlight key insights and trends.
The Future of AI in Social Media Marketing
As AI technology continues to evolve, its role in social media marketing will only become more prominent. We can expect to see further advancements in areas such as:
Predictive Analytics: AI will be able to anticipate future trends and predict customer behavior with greater accuracy.
Chatbot Integration: AI-powered chatbots will become more sophisticated, providing personalized customer service and driving sales directly through social media platforms.
Content Creation: AI may even play a role in generating social media content, assisting marketers with tasks such as writing captions and designing visuals.
In conclusion, AI is transforming social media marketing from a largely manual and reactive process to a data-driven and proactive approach. By leveraging the power of AI, businesses can gain a deeper understanding of their customers, optimize their campaigns, and ultimately achieve their marketing goals. Embracing AI is no longer a luxury but a necessity for any business seeking to thrive in the ever-evolving social media landscape.
0 notes
ukwebgeekz · 5 years ago
Text
Custom Extraction Methods & Tools
It should not need saying that a website is essential these days for any business to succeed. With literally billions of potential customers online, businesses simply can't afford to miss out. In 2018, global 'e-retail' sales totalled something in the region of $2.8 trillion. The projected figure for 2021 is nearer to $4.8 trillion.
Competition for online business is fierce, as you might expect. The companies with the best SEO strategies are clearly going to win. Those who don't invest in SEO, or who neglect their websites, will suffer - it's as simple as that.
You might have excellent SEO and a great website. But how do you give your business the edge over your competition? There are many tools and tricks out there to boost your chances and help you maximise opportunities, and custom extraction is one.
What is Custom Extraction?
First of all, let's think back to the average office about fifty years ago. Immediately, you'd note the absence of computers in most offices, and where they were present they weren't exactly 'user friendly' or powerful - compared with today's models. When data was required, it would mostly be left to humans to gather. It was often a painstaking, laborious, and thankless task. And then there is the fact of 'human error'. Yes, reports were made, and frequently made well. But it took time and effort that often exceeded its usefulness. As time went on, and more data became available, it became less worthwhile. The internet wasn't around at the time, so you couldn't dive in to extract the data you required.
Jump forward to today, and we have information at our fingertips 24 hours a day. In fact, we have vast, overwhelming amounts of information and data. In a sense, we are faced with a similar problem to our counterparts of half a century ago; sifting through masses of data to find anything of value is not a good use of our time. And people still make mistakes at times. This is where custom extraction can be a useful tool, especially Screaming Frog which is our favorite. Special software is put into action that scours websites and grabs the data you need. It then saves it in a file for you to analyse and present in a range of formats.
And this can save you resources, time, and money.
How does it work?
If you're not familiar with the technical terms, it can all get a bit confusing. Basically, all websites are built using a code called HTML. Other codes are used in the design and presentation, but HTML is the basic framework that all internet browsers recognise. If you were to look at the actual code for a website, you might be amazed at just how much is there. If you were to search through it to find a specific bit of information, it could take a while.
Using a 'web scraper' you can find that information quickly. These scrapers are also sometimes referred to as 'bots', 'spiders', or 'spiderbots' and are the same type of tools used by search engines to scour the millions of pages that make up the internet in order to index relevant sites for any particular search.
In terms of custom extraction, the program can search a website, or even something like a Google SERP (search engine results page), and extract the data you need in seconds, even if the website has multiple URLs (possibly over a million!).
Web data extraction is growing in popularity as businesses are waking up to its importance. Custom data extraction takes this a step further by speeding up the process and automating it. You can download software like Screaming Frog to perform this for you, set the required parameters, and wait for the results.
And these results can seriously boost your company's performance or replace a time consuming method with a fast efficient data export.
An example of custom extraction for stock checks.
A lot of large online stores not only sell their own products but those of small or large suppliers that may not have high tech logistics or even a stock list at all. This can be a problem with customers on your site buying a product that you then cannot fulfill.
So how can we fix this?
You need to have a little bit of code knowledge specifically HTML & CSS but it really is quite simple. Although the code/method you use might differ slightly on different websites due to the structure or validity of the code, or indeed because its behind a login/wall.
You will need a tool like XPather to inspect the code on the page you want to extract.
For example:
A TV site might have some items listed as "out of stock" or "limited stock" or "due back in October" etc along with its product code/identifier on the same page. We can use XPather to give us the CSS path, Text Path or Inner HTML path of those values.
It will look something like this:
//div[@id='product-information']/p[1]
//div[@id='product-details']/p[1]
These 2 lines of code point to the exact path in the back-end code of the web page that you the visitor can see. We can enter these into the "custom extraction" tool within Screaming Frog and set the bot running (you can define a speed which if too fast is not ethical to the site you are crawling so be careful but this would need to be covered in another post).
Tumblr media Tumblr media Tumblr media
An example of the custom extraction being added
Once Screaming Frog has finished crawling the site (you can check its progress in the custom extraction tab) you can export the data to a spreadsheet. In this case it would be one column of product codes and in the next column its stock status. We can then use this information to do a mass export/import on the eCommerce site to hide out of stock items or un-hide back in stock items. This was a savior for one of our clients who at the time were manually checking or noticing too late after orders were made.
What are the benefits of Custom Extraction?
Here are a few examples of how this software - and the data it produces- can help:
Website / SEO audits
Your website can hide many technical errors that the human eye may not see which is even more prevalent on large sites with 1000`s of pages. With these tools you can quickly gather all of the data you need to make it easier to spot technical SEO errors.
Keywords
Find the most suitable keywords for your business of product and see which keywords your competitors are using. Compile a chart to see which are ranking highest.
Research
Regularly keep an eye on your competitors through an automated system, giving you information on their finances, funding, pricing, new developments, technology, and so on. Also, the 'web scraping' software can access content that is usually only available through downloading or creating an account.
Monitoring news sites
Some businesses operate in markets that are affected by events around the globe. This software can automatically gather data that helps them take timely and appropriate action - possibly before their rivals.
Locating a target audience
Custom extraction tools can search social media to discover general trends and to see which products are in demand. They can also explore existing customers to determine satisfaction levels. Both methods help to direct future marketing campaigns, seize opportunities, and maintain customer loyalty.
Brand image and awareness
Image is everything. The World Wide Web gives virtually everyone the opportunity to express their opinion on anything, including your business. Through web data extraction you can find out exactly what people think. Then you can use this data to maintain that image, or take steps to improve it, tracking changes through new reports.
Track Stock Volumes
By performing searches on your own pages, you can keep track of stock levels. One of the most frustrating aspects of online shopping is the message 'out of stock'. This can help out eCommerce companies massively if they are a distributor of suppliers products and do not have up to date stock lists from said supplier. (we have actually used this for a client it worked a treat during the busy Covid period!)
Keep your website updated
In cases of 'time-sensitive' events, like Black Friday, you need to quickly locate instances where wording and details need to be adjusted. Data extraction software keeps track of these locations to ensure they are updated.
Pricing strategy
Instantly keep abreast of fluctuations in prices for service and goods offered by your competitors. Analyze their products, rates, and features.
The benefits and advantages stretch way beyond this list, but it serves as an example of how useful custom extraction can be. Installing this software makes good sense for any business that is serious about SEO.
Based in the North East of England UK Web GeekZ are innovators in the search engine optimisation industry and use tactics that work time and time again, give us a call if you would like to get more traffic or simply a new website design.
  UK WEB GEEKZ
5-7 Amity House, Coniscliffe Road, Darlington, County Durham, DL3 7EE 07906392955
Post Source Here: Custom Extraction Methods & Tools Source Website https://www.ukwebgeekz.com
0 notes
hub-pub-bub · 6 years ago
Link
Product and price. Once upon a time they were all you needed to worry about because that was how you attracted customers and beat the competition. Not any more. With just a few keystrokes, a potential customer can compare prices from distributors around the world and read all about the product’s features, benefits, problems, product ratings and more. Shopping has moved online, and if you want to compete, it’s all about the customer experience.
While it’s true that physical stores still account for most retail sales, it’s fair to say we are in the middle of an economic shift, where the rules of customer experience are being completely rewritten. The Office for National Statistics says online sales in the UK, as a proportion of all retailing, reached 18.9% by March 2019. Never before has the customer been more in focus, more accessible and had more influence over the success of a commercial organization. For a company to stay competitive, the focus must be on the customer experience.
The Evolution of the Customer Experience
If you are marketing online, the browser is your store’s front window, sales associate and cashier. You need to ask yourself: Is our website easy enough to navigate or are there too many steps between the search and the checkout? Does anything function too slowly? Problems with any one of these — or countless others — can affect sales, but the task of responding appropriately to each of these factors can be overwhelming.
But the potential for problems doesn’t end with the browser experience; in fact, it’s just getting started. Forward-thinking businesses understand that competing in today’s crowded market requires a seamless blend of shopping options — whether it's online, physical store, mobile, voice assistant or a customer service call.
The first challenge comes in figuring out how to connect all those touchpoints, giving customers a seamless experience with no reason to look elsewhere. But to do that, data from multiple shopping options and the delivery of timely content needs to be regularly analyzed and checked. This isn’t easy to do, even if you have the internal resources available.
Better Technology Can Provide Better Customer Experiences
Consider the value of customer data. When a business is able to collect and analyze relevant customer data, the company can improve its ability to deliver a seamless experience. The right technology allows businesses to sift through massive amounts of customer data and provide real-time assistance at scale. Digital experience platforms can optimize a customer’s entire onsite journey, ensuring that each customer interaction is relevant and personalized. Whether customers interact with your business via its website, app or voice, their experience depends heavily on the data connections happening behind the scene. Failing to make use of these connections means customers can be asked the same questions over and over, offered irrelevant services or content and, ultimately, given a less than ideal user experience.
Knowing the importance of the digital landscape, companies are anxious to go digital. Gartner predicts that by 2021, 90% of global organizations will rely on system integrators (SIs), agencies and channel partners to design, build and implement their digital strategies.
Letting AI Take the Lead
By using data in ways that provide actual convenience and by connecting customer information across all paths of their journey, businesses can offer a more informed and tailored approach to each user. And AI, powered by a set of algorithms and advanced merchandising capabilities that allow brands to optimize each visitor’s search, browse and landing page experiences, can do just that.
Let’s take a look at how this could work in the real world. A retailer, let's call it "Reggie's," has a goal to become the nation’s leading specialist in widget manufacturing and maintenance. Now, for Reggie’s to pull this off, would require a new approach when it came to its digital offerings.
Let’s assume Reggie’s already has a workable search function and that 20% of online sessions included searches, and this is driving 40% of the company’s online sales. You’d think any company would be happy with those numbers. But managing this by conventional methods would mean lots and lots of resources. This kind of operation takes an intensive amount of human labor to manage a company’s ever-changing product inventory, including all the manual rule writing necessary to ensure that each search will be optimized.
The problem isn’t in the operation, the problem is the resources — or lack thereof. After a certain point, it becomes impossible to keep up. This is where AI comes in. AI can make it possible to continuously improve and optimize search results with custom algorithms — without adding more and more resources. This way, Reggie’s would no longer have to manually write new rules, and it could eliminate the time-intensive tasks and the many hours of work that accompany them.
Now, let’s imagine going one step further. What if Reggie’s were to combine its new search function with data-driven insights to provide its customers with even more relevant results? This would not only make it possible to recommend the right search results, but it would also allow it to gather valuable information on its customers’ habits each time they visit the site.
Working Smarter, Not Harder
Improving customer experience has become an increasingly complex exercise. However, by using the right experience platform and AI tools, you too can gain more control over every aspect of the experience from search and merchandising through to CMS. Implementing the right AI solution provides the means of driving visitors to purchase faster and eliminating your operations most time-consuming tasks. And this, of course, gives you more time for the other all-important factor in retail: the human touch.
About the Author
Raj De Datta is the CEO and co-founder of digital experience software provider Bloomreach. He brought 10 years of enterprise and entrepreneurial experience with him when he co-founded the company back in 2009.
How Do You Compare?
See what 350 digital executives said about their 2019 strategies and challenges.
Read the Report
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
martechadvisor-blog · 7 years ago
Text
3 Ways to Use Your Data to Drive Personalized Customer Experiences
Leading companies have been investing in their customer experience with the hopes of improving their customer satisfaction, engagement, and net revenue. 90% of senior-level marketers say improving and personalizing their customer experience is among their top priorities. The problem, though, is that only 23% claim to have a customer-centric — versus a channel- or product-centric — organizational structure, according to Forrester. Without a customer-centric culture, your customer experience — and ultimately your company — will inevitably suffer. In the U.S. alone, the estimated cost of customers switching due to a poor experience is $1.6 trillion. So, how are marketing leaders expected to keep up and even surpass unrelented customer expectations while keeping profits up?
The short answer is deceptively simple: data. Data is at the center of customer-centricity but is still going vastly underutilized. Though CMO’s recognize the huge opportunity big data brings, 61% admit they are not using their data to effectively drive their customer experience. Data is often perceived as more of an obstacle than anything else and trying to sift through it manually will stop any company from scaling. Given that insight-driven organizations are growing 8x faster than the global GDP, I want to make sure you put your data to use sooner rather than later.
Recent developments in AI are not only automating the customer data-crunching process and alleviating associated intimidation, they’re fundamentally changing the way leading organizations are interacting with their customers. Though using data to inform decisions is nothing new, using data to drive the relevance of digital customer experiences and wow customers at every touchpoint is. How are they doing it? They’re taking a holistic approach. They’re looking at their entire marketing strategy through the lens of their customers and are aligning their marketing efforts to their customer journeys and you can too. Here’s are 3 ways to use your data to drive personalized customer experiences.
1. Improve the relevance of your website
Relevant customer experiences drive conversions — sounds simple, doesn’t it? Just because it’s simple doesn’t mean it’s easy, but with AI, it’s actually never been easier.
With AI-powered search and recommendations, all your public-facing company information, from product details on your website, to customer reviews on third party sites, to videos from your knowledgebase can be unified into one single index and made available to your customers. This means, in just a click of a button, your customers will have the information they’re searching for, the moment they need it, plus other salient content that they will most likely need next.
However, standard search results are no longer cutting it for more tech-savvy customers. They expect Amazon-like experiences, which take into account their context and infer their intent. In short: they expect relevant interactions. Your customer satisfaction and customer experience go hand-in-glove and to keep both high, you need to treat each of your visitors as the person they are, as opposed to a number. This is how AI and machine learning are changing the game.
Through usage analytics data and machine learning, AI-powered search enables you to automatically track your visitors' journey and better understand what content they’re looking for. Delivering contextual and immediate information to each visitor not only enhances their experience with your brand, it contributes a great deal to their overall satisfaction and leads to more time spent on site as well as higher conversions, a true win-win.
2. Allow your customers to self-serve
The preference to self-serve continues to rise. Dimension Data has found that 73% of customers prefer to use a company’s website to find what they need. What’s more, Gartner predicts that by 2020, customers will manage 85% of their relationships with an enterprise without interacting with a human. Needless to say, self-service isn’t going anywhere and the sooner you make it part of your customer experience strategy, the better.
When it comes to a relevant self-service, speed is not the ultimate goal. It’s not just about delivering the best information within seconds, it’s about building a connection with each customer and evoking a positive emotional response. In fact, McKinsey found that 70% of buying experiences are based on how the customer feels they are being treated.
Your customers want to feel like you know them and appreciate their business, and they want you to prove just how well you know them by offering suggestions for other products and services they like and if they have a question, they want to feel like their time is respected and empowered to answer it on their own without calling into your support center. It sounds like a tall order, but again, your data holds the key to their satisfaction and AI-powered search and recommendations helps you to unlock your organization’s full potential.
Machine Learning gathers insight from your data and learns from your other customers’ behavior to predict what information would be most helpful to suggest next. These proactive recommendations are essential in driving engagement and influencing both cross-sell and upsell opportunities.
When your customers have a question, they will almost always turn to your site for answers first. A unified index, coupled with machine learning algorithms based on data indicating what has helped others like them in the past, will ensure their self-service success. Effective self-service will also lower your customer effort score, increase your net promoter score and improve customer satisfaction overall.
Don’t be discouraged if your data shows that they didn’t find what they were looking for. This is another critical data point that can help you to improve their experience. With a better understanding of what content gaps exist, you can offer valuable guidance on how to fill them. This might mean new content for your blog, a new knowledgebase article or perhaps a new product feature. No matter how you cut it, data is your friend and should be used to continually improve your customer experience strategy.
3. Think beyond marketing
Your marketing team is not the only one who will benefit from customer data, nor is it their responsibility alone to carry the weight of the customer experience. Every customer-centric organization knows everyone has a role to play in their customer journey, which is why it’s so important to share knowledge and data enterprise-wide. This will ensure that no matter what part of the journey your customer is in — be it discovering your brand on your website, engaging with sales post-purchase, or on the phone with one of your support agents — they’re receiving a relevant and personalized experience.
Using your customer data to have a holistic view of their journey, from the channel they prefer, to the content they’ve engaged with, to the products or services they’ve purchased, to the tickets they’ve logged in the past, means that you can make informed and data-driven decisions about how best to improve their individual experience.
Equipping your support team with known information about your customer will make the support process more enjoyable, timely and most cost effective. Enabling your sales team with your customer data can help them to close deals faster, and providing consumer insight trends and behaviors to your product team can help inform your product roadmap for next year. Everything you know about your customer can be used to improve your customer experience strategy and ensure the satisfaction of every customer who engages with your brand at every touchpoint.  
For most marketers, customer centricity is no more than an aspirational goal but it’s time to make it a reality. Knowing your customers is more important than ever before. Leveraging your data and refocusing your strategy based on customer insights about how they behave, not how you think they behave, is critical in delivering an experience that wows them. Today’s customer has more power and choice than ever before, it’s your job to make sure they choose you.
If your current tech stack doesn’t allow you to gain customer insight and make it easy for you to act on, you will continue to fall short of your customer's expectations. With AI-powered search and recommendations, you can bridge the gap. For more information about how an insight engine can help improve your customer experience and drive customer satisfaction, feel free to reach out to me on Twitter, I’d be happy to help.
This article was first appeared on MarTech Advisor
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes
terabitweb · 6 years ago
Text
Original Post from FireEye Author: Philip Tully
Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious software binaries. When performing malware analysis, they successively apply these tools in order to gradually gather clues about a binary’s function, design detection methods, and ascertain how to contain its damage. One of the most useful initial steps is to inspect its printable characters via the Strings program. A binary will often contain strings if it performs operations like printing an error message, connecting to a URL, creating a registry key, or copying a file to a specific location – each of which provide crucial hints that can help drive future analysis.
Manually filtering out these relevant strings can be time consuming and error prone, especially considering that:
Relevant strings occur disproportionately less often than irrelevant strings.
Larger binaries can output upwards of tens of thousands of individual strings.
The definition of “relevant” can vary significantly across individual human analysts.
Investigators would never want to miss an important clue that could have reduced their time spent performing the malware analysis, or even worse, led them to draw incomplete or incorrect conclusions. In this blog post, we will demonstrate how the FireEye Data Science (FDS) and FireEye Labs Reverse Engineering (FLARE) teams recently collaborated to streamline this analyst pain point using machine learning.
Highlights
Running the Strings program on a piece of malware inevitably produces noisy strings mixed in with important ones, which can only be uncovered after sifting and scrolling through the entirety of its messy output. FireEye’s new machine learning model that automatically ranks strings based on their relevance for malware analysis speeds up this process at scale.
Knowing which individual strings are relevant often requires highly experienced analysts. Quality, security-relevant labeled training data can be time consuming and expensive to obtain, but weak supervision that leverages the domain expertise of reverse engineers helps accelerate this bottleneck.
Our proposed learning-to-rank model can efficiently prioritize Strings outputs from individual malware samples. On a dataset of relevant strings from over 7 years of malware reports authored by FireEye reverse engineers, it also performs well based on criteria commonly used to evaluate recommendation and search engines.
Background
Each string returned by the Strings program is represented by sequences of 3 characters or more ending with a null terminator, independent of any surrounding context and file formatting. These loose criteria mean that Strings may identify sequences of characters as strings when they are not human-interpretable. For example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear within a binary, Strings will interpret this as “1337.” However, those ASCII characters may not actually represent that string per se; they could instead represent a memory address, CPU instructions, or even data utilized by the program. Strings leaves it up to the analyst to filter out such irrelevant strings that appear within its output. For instance, only a handful of the strings listed in Figure 1 that originate from an example malicious binary are relevant from a malware analyst’s point of view.
Figure 1: An example Strings output containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.
Ranking strings in terms of descending relevance would make an analyst’s life much easier. They would then only need to focus their attention on the most relevant strings located towards the top of the list, and simply disregard everything below. However, solving the task of automatically ranking strings is not trivial. The space of relevant strings is unstructured and vast, and devising finely tuned rules to robustly account for all the possible variations among them would be a tall order.
Learning to Rank Strings Output
This task can instead be formulated in a machine learning (ML) framework called learning to rank (LTR), which has been historically applied to problems like information retrieval, machine translation, web search, and collaborative filtering. One way to tackle LTR problems is by using Gradient Boosted Decision Trees (GBDTs). GBDTs successively learn individual decision trees that reduce the loss using a gradient descent procedure, and ultimately use a weighted sum of every trees’ prediction as an ensemble. GBDTs with an LTR objective function can learn class probabilities to compute each string’s expected relevance, which can then be used to rank a given Strings output. We provide a high-level overview of how this works in Figure 2.
In the initial train() step of Figure 2, over 25 thousand binaries are run through the Strings program to generate training data consisting of over 18 million total strings. Each training sample then corresponds to the concatenated list of ASCII and Unicode strings output by the Strings program on that input file. To train the model, these raw strings are transformed into numerical vectors containing natural language processing features like Shannon entropy and character co-occurrence frequencies, together with domain-specific signals like the presence of indicators of compromise (e.g. file paths, IP addresses, URLs, etc.), format strings, imports, and other relevant landmarks.
Figure 2: The ML-based LTR framework ranks strings based on their relevance for malware analysis. This figure illustrates different steps of the machine learning modeling process: the initial train() step is denoted by solid arrows and boxes, and the subsequent predict() and sort() steps are denoted by dotted arrows and boxes.
Each transformed string’s feature vector is associated with a non-negative integer label that represents their relevance for malware analysis. Labels range from 0 to 7, with higher numbers indicating increased relevance. To generate these labels, we leverage the subject matter knowledge of FLARE analysts to apply heuristics and impose high-level constraints on the resulting label distributions. While this weak supervision approach may generate noise and spurious errors compared to an ideal case where every string is manually labeled, it also provides an inexpensive and model-agnostic way to integrate domain expertise directly into our GBDT model.
Next during the predict() step of Figure 2, we use the trained GBDT model to predict ranks for the strings belonging to an input file that was not originally part of the training data, and in this example query we use the Strings output shown in Figure 1. The model predicts ranks for each string in the query as floating-point numbers that represent expected relevance scores, and in the final sort() step of Figure 2, strings are sorted in descending order by these scores. Figure 3 illustrates how this resulting prediction achieves the desired goal of ranking strings according to their relevance for malware analysis.
Figure 3: The resulting ranking on the strings depicted in both Figure 1 and in the truncated query of Figure 2. Contrast the relative ordering of the strings shown here to those otherwise identical lists.
The predicted and sorted string rankings in Figure 3 show network-based indicators on top of the list, followed by registry paths and entries. These reveal the potential C2 server and malicious behavior on the host. The subsequent output consisting of user-related information is more likely to be benign, but still worthy of investigation. Rounding out the list are common strings like Windows API functions and PE artifacts that tend to raise no red flags for the malware analyst.
Quantitative Evaluation
While it seems like the model qualitatively ranks the above strings as expected, we would like some quantitative way to assess the model’s performance more holistically. What evaluation criteria can we use to convince ourselves that the model generalizes beyond the coverage of our weak supervision sources, and to compare models that are trained with different parameters?
We turn to the recommender systems literature, which uses the Normalized Discounted Cumulative Gain (NDCG) score to evaluate ranking of items (i.e. individual strings) in a collection (i.e. a Strings output). NDCG sounds complicated, but let’s boil it down one letter at a time:
“G” is for gain, which corresponds to the magnitude of each string’s relevance.
“C” is for cumulative, which refers to the cumulative gain or summed total of every string’s relevance.
“D” is for discounted, which divides each string’s predicted relevance by a monotonically increasing function like the logarithm of its ranked position, reflecting the goal of having the most relevant strings ranked towards the top of our predictions.
“N” is for normalized, which means dividing DCG scores by ideal DCG scores calculated for a ground truth holdout dataset, which we obtain from FLARE-identified relevant strings contained within historical malware reports. Normalization makes it possible to compare scores across samples since the number of strings within different Strings outputs can vary widely.
Figure 4: Kernel Density Estimate of NDCG@100 scores for Strings outputs from the holdout dataset. Scores are calculated for the original ordering after simply running the Strings program on each binary (gray) versus the predicted ordering from the trained GBDT model (red).
In practice, we take the first k strings indexed by their ranks within a single Strings output, where the k parameter is chosen based on how many strings a malware analyst will attend to or deem relevant on average. For our purposes we set k = 100 based on the approximate average number of relevant strings per Strings output. NDCG@k scores are bounded between 0 and 1, with scores closer to 1 indicating better prediction quality in which more relevant strings surface towards the top. This measurement allows us to evaluate the predictions from a given model versus those generated by other models and ranked with different algorithms.
To quantitatively assess model performance, we run the strings from each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their predicted ranks with a baseline of the original ranking of strings output by Strings. The divergence in distributions of NDCG@100 scores between these two approaches demonstrates that the trained GBDT model learns a useful structure that generalizes well to the independent holdout set (Figure 4).
Conclusion
In this blog post, we introduced an ML model that learns to rank strings based on their relevance for malware analysis. Our results illustrate that it can rank Strings output based both on qualitative inspection (Figure 3) and quantitative evaluation of NDCG@k (Figure 4). Since Strings is so commonly applied during malware analysis at FireEye and elsewhere, this model could significantly reduce the overall time required to investigate suspected malicious binaries at scale. We plan on continuing to improve its NDCG@k scores by training it with more high fidelity labeled data, incorporating more sophisticated modeling and featurization techniques, and soliciting further analyst feedback from field testing.
It’s well known that malware authors go through great lengths to conceal useful strings from analysts, and a potential blind spot to consider for this model is that the utility of Strings itself can be thwarted by obfuscation. However, open source tools like the FireEye Labs Obfuscated Strings Solver (FLOSS) can be used as an in-line replacement for Strings. FLOSS automatically extracts printable strings just as Strings does, but additionally reveals obfuscated strings that have been encoded, packed, or manually constructed on the stack. The model can be readily trained on FLOSS outputs to rank even obfuscated strings. Furthermore, since it can be applied to arbitrary lists of strings, the model could also be used to rank strings extracted from live memory dumps and sandbox runs.
This work represents a collaboration between the FDS and FLARE teams, which together build predictive models to help find evil and improve outcomes for FireEye’s customers and products. If you are interested in this mission, please consider joining the team by applying to one of our job openings.
#gallery-0-5 { margin: auto; } #gallery-0-5 .gallery-item { float: left; margin-top: 10px; text-align: center; width: 33%; } #gallery-0-5 img { border: 2px solid #cfcfcf; } #gallery-0-5 .gallery-caption { margin-left: 0; } /* see gallery_shortcode() in wp-includes/media.php */
Go to Source Author: Philip Tully Learning to Rank Strings Output for Speedier Malware Analysis Original Post from FireEye Author: Philip Tully Reverse engineers, forensic investigators, and incident responders have an arsenal of tools at their disposal to dissect malicious…
0 notes