Welcome to the IKANOW Knowledge Discovery Blog. Here you will find our Technical and Analytic rants.
Don't wanna be here? Send us removal request.
Link
Join us, on April 5th, to discuss the business value of unstructured data in analytics and benefits of open source technologies in providing robust, big data solutions.
4 notes
·
View notes
Link
We learned about an interesting open source player called Ikanow. One of my colleagues pronounced it, “I can know.” Sounded good to me. You can get information about the firm’s solutions for “agile intelligence.”
8 notes
·
View notes
Link
Great post on the importance of open data for citizens and government!
2 notes
·
View notes
Link
Allows the easy seeding of urls from Mongodb into Nutch. This is similar in nature to that of the DmozParser that comes with Nutch. This provides a way to bootstrap and seed Nutch with data coming directly from Mongodb. The injector add urls from a specified mongodb to the crawldb of your choice. - CM
12 notes
·
View notes
Link
Allows direct indexing of Nutch crawl data directly into Mongodb. This is similar in nature to that of the SolrIndexer that comes with Nutch which let you index directly into Solr. This provides a way directly index data into Mongodb coming directly from Nutch. - CM
15 notes
·
View notes
Link
Allow the indexing of Nutch crawl data directly into elasticsearch. This is similar in nature to that of the SolrIndexer that comes with Nutch which let you index directly into Solr. This provides a way directly index data into elasticsearch coming directly from Nutch - CM
19 notes
·
View notes
Link
81 notes
·
View notes
Text
Enabling reasoning from unstructured and structured data
Before we really start to get into this whitepost it is important to ground the information with a set of definitions that will be used throughout.
Unstructured Data = Information that does not have a predefined model and is typically text-heavy but may contain dates, numbers and facts.
Structured data = Typically associated to a data model which determines a predefined structure to data, typically associated to database models.
Semi-structured data = A form of structured data that does not conform to the formal structure of tables and data models typically associated to relational databases.
definition citations are from wikipedia
As we know, information is growing at an enormous rate with no real end in sight. Based on an IDC Digital Universe report, which was underwritten by EMC, released estimates that the Digital Universe (eg every electronically stored piece of information) will reach 1.2 million petabytes or 1.2 zettabytes this year. To imagine this, John Gantz and David Reinsel, authors of the IDC report "picture a stack of DVDs, reaching from the earth to the moon and back." (that is about 240,000 miles each way or driving across the United States 80 times).
Of this, the majority of the growth is occurring in the evolution of the major forms of media and the expansion in social media collaboration, which is the form of unstructured data. So critical questions arise in how to we come to grips with this reality and how to gain insight from the information.
O'Reilly is moving forward with the concept of the Strata conferences which focuses on Big Data in our society and provides a vehicle for Data Science to begin to explore some of the questions. It is critical for both business and government to begin to look at this opportunity to gain impressive insights into operations and provide a competitive advantage to those operations.
So what are of the issues that arise with the unstructured information and growth and the impediment it creates on the analysis process.
This expansion has prompted quite a few emerging technologies that have grown up out of necessity to handle the volumes of information and provide ways to look at information in new and unique ways. It is truly an exciting time and will continue to be for quite a while. This has also been prompted by a changing problem. The big data problem is being creating by us, the user interactions generating trillions of transactions, thus generating the 1.2 zettabytes of data that we will eclipse this year. We have discussed NoSQL before on this blog and this is likely a good time to bring it up a again as one of the enabling principles for the emerging technologies aiming to enable us with more business intelligence than ever dreamed.
So how do we start to handle the legacy side of all this beyond the rapidly growing unstructured data. The datacenters containing all the line of business data contained in fileshares, data warehouses and relational database environments are in a lot of ways the brain for business. It dictates how they operate, how they make decisions and in effect how they survive on a daily basis. It also provides a wealth of historical knowledge. With the growth of unstructured information, that is equally important to line of business operations, has created a chasm where we have structured information on one side and unstructured on the other. What questions could you attempt to answer if you were to able to search and retrieve using a unified generic data model?
This is one of the reasons for the emerging technologies that focus on enabling semi-structured information content storage and retrieval. So in effect structured + un-structured = semi-structured.
So in order to examine this further we took our Infinit.e platform and setup a simple use case around Nigerian terrorism events that occurred in 2010. We utilized publicly available information from the Worldwide Incident Tracking System from the National Counter Terrorism Center (NCTC), and geopolitical news collection from the internet during the 2010 time period. The premise of the exercise was to investigate various militant groups within Nigeria and to specifically look for significant patterns, events and linkages. The purpose was to illustrate the fusion of the information utilizing a generic semi-structured data model temporally, geo-spatially and contextually using the data sources and corresponding aggregations.
The first step was to begin to ingest the necessary unstructured data to create a corpus of unstructured geo-politically relevant information to Nigerian. An example article is below.
To do this we pulled several related Google News feeds and utilized our entity extraction framework too fuse two commercially available natural language processing technologies "Open Calais" and "Alchemy API" to extract entities and events from the unstructured text. The example below illustrates some of the entities and events that were extracted from one of the documents.
For more information checkout our data model on contained in our API documentation. This representation creates a generic representation of the document based on the knowledge the natural language process can obtain effectively creating generic metadata representation of the unstructured content.
The next step was to ingest the structured content into the data model. For our purposes, this illustration utilized an XML representation of the WITS data environment and used the IKANOW unstructured analysis handler to map the information to the generic data model. This is done by building up the metadata contained in the unstructured data into corresponding entities and events that enable more meaning to be derived from the information. The example below illustrates an example of the raw information contained in WITS.
In order to build up the mappings, the unstructured analysis handler provides for source ingestion and the ability to generate scriptlets to build up the corresponding entities and events in the model. To do this, one needs to create a small amount of javascript based on the metadata available to create the necessary data structures. This enables very simple text processing to occur all the way through enabling extremely complex text processing with javascript. The example below illustrates how we construct these source mappings and how the use of script is adopted. For further information visit our API documentation which includes documentation on the structured analysis handler.
Now we ready to ingest and examine some data. The ingestion process occurs real time and is continuously updated when new information becomes available from the defined sources. For the purposes of this whitepost we are using a defined set of information based on a defined time period. The example below illustrates a harvested record from the unstructured data source.
In the above example entities and events were built up from the available structured metadata to create more complex structures. This allows us to bridge the information with unstructured news reporting in order to look for trends or events in the data. The examples below illustrate some of types of visualizations that were generated in this example.
To create these various visualizations we started with a topic or premise for our research and for this variation we chose to look at Nigerian Terrorism Events occuring during 2010.
To do this we took to look two different organizations operating in Nigeria. The first Boko Haram and Movement for the Emancipation of the Niger Delta (MEND). For each we began to look for an increase in events or trends during the time period, changes in the types or motives of the events and the ability to link any of these relationships. This was in a effort to look at the potential impacts of the activity and how it impacts domestic organizations, government, foreign policy and these relationships.
The screen captures above illustrate some of the visualizations created from the use case. We were able to create specific relationships between events and facts that illustrated specific patterns in the information that tied to both organization. In addition we were able to quickly perform aggregations of the events based the types over time to show increases in specific activity. This was done by merging very unstructured open source reporting with very structured database reporting and provided a way unify the information contextually, geo-spatially and temporally which enabled a way to perform reasoning from the information.
-- CM
34 notes
·
View notes
Text
Monitoring crime data for patterns and trends
Recently we decided to take a look at performing meaningful analysis on structured law enforcement data (crime reports) and fuse that with unstructured data produced through news mediums and social network mediums.
Critical to finding actionable intelligence is determining the appropriate data necessary to realize the use case. First, we started by looking at publicly available data from the DC Data Catalog. Specifically, we ingested the available Crime Incident Reports and several of the available geo-spatial layers to illustrate how the Infinit.e Structured Analysis toolset can handle the structured report data. Secondly, we then fused this with unstructured data from various social media, blogs and news sources and some synthetic data to simulate intelligence reporting activities. This provided a foundation to stress the unstructured and structured data harvesting capabilities.
The Problem
When performing analysis activities you need to be able to look at many different data types in many different ways, which includes both unstructured and structured information. Adding to the challenge, the information necessary to review is produced sporadically so there must be a way to ensure you have access to latest information possible.
Often in the case of Law Enforcement data the information is contained in various information silos where a single type of information resides that is pertinent to the problem. This is joined with governance issues surrounding the access of the information. For example case records are in one system and tip records are in another system with policies governing how this information can be accessed, many times having different “owners”. This causes timeliness productivity and exposure issues when trying to find the information you are looking for, much less for examining for patterns and trends.
Additionally, it’s necessary to query the data on different types of terms that are contextual, geo-spatial and temporally based. If unable to look at the information through these query lenses, you cannot access the richness contained in the information. Once the data is accessed, there is a need to be able to visualize the information in a way that tells the story based on the question(s) asked. Most humans are visual learners and retain nearly double the amount of information when presented in a visual fashion versus text.
The Challenge
Since the data is spread out across multiple environments and mediums, there is a need to collapse multiple systems (e.g. case management, emergency management alerting, department of corrections databases, records management systems, social media and news collection) into a unified analytics and visualization platform.

Since the various data sources and types are of differing structural nature, you must start to look at a way to make the information more “semi-structured”. This provides the mechanism to begin the fusion process. The challenge is that all the various data environments must be accessed to provide the holistic approach. Therefore, it is necessary to disparate the environments and all the policies necessary for access and overall information security. All of these nuances must be understood before you can begin the data integration process.
The Solution
The focus of this use case was to look at robbery, theft or more specifically muggings (street mugging, robbery, purse snatch, bag snatch, etc). We investigated these types of crimes in Washington DC over a three year period. The data utilized was pulled from Metropolitan Police Department Crime reports, various Washington DC Blogs, Social Media Applications such as twitter, as well a generated tip reports and demographic data to apply additional context.
The video below illustrates the use case and how we investigated the information for patterns and trends. The analysis starts out very broad and begins to narrow based on location and time until the area of interest is obtained.
During this video we looked and several different types of information and viewed it statistically, geo-spatially, and temporally to expose different cuts on the information to examine it for abnormalities of items of interest. Arguably, analyzing the data in all three visualization venues is required to obtain the full situational awareness.
The "So What"
On the technical side is the ever-changing data environment. Unstructured data is growing at an exponential rate. Structured data is continually updated with new police reports. Simply, a plethora of new information is being continuously produced at a very rapid pace from many different mediums, both unstructured and structured. An open, scalable analytic platform must be able to handle the data requirements.
On the consumer side, analysts must be able to become more predictive in determining what action to take. This is where unstructured social media sources come in. Fusing social media data with other unstructured and structured data enables analysts to not only get a more complete picture, or situational awareness, but also to “see” or anticipate events coming, such as robberies. Having the tools and mechanisms to easily visualize the information in meaningful ways allows humans to draw conclusions by queuing them to look for patterns and trends.
This example illustrates how this can be done using the Infinit.e toolset and how it can be applied to different types of information easily and quickly to allow for the discovery of patterns and trends.
--CM
2 notes
·
View notes
Text
Data Visualization with Protovis, jQuery and Infinit.e API
We have been hard at work expanding our REST API and recently we had the time to create some data visualizations to test the APIs using the Protovis library and jQuery. In order to create these examples we used jQuery to access our API and visualized the information in a couple of different ways using Protovis. With Protovis no longer being supporting, we will be porting these examples over to d3.js and review in a follow-on post.
The purpose of this illustration is to provide a simple, easy to use example on how a third party environment can consume the Infinit.e API to create robust and complex data visualizations.
Protovis is a free, open-source library that can be used to create visualizations. It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser).
jQuery is used to provide a simple way to consume the API using Ajax and traverse the JSON on information returned.
For the purposes of this illustration we performed a system query on the exact text term "Arab Spring" and the event "Person Travel". From the Infinit.e API, we turned on the majority of the available aggregations to include: "entities, events, facts, and meta tags". For more documentation related to accessing the API checkout our wiki.
The first step was to get the data back from the API using jQuery.ajax(). This allowed us to make the necessary calls back to our API, as well as the flexibility to build in dynamic controls to manipulate the information. For the purposes of this post we are only going to focus on a simple query.
Once the data is returned from the API using jQuery, we had to decide what visualizations we wanted to use from the Protovis library. There is a wide array of available examples. Since our data is fairly dense in nature due to the aggregations and events that are present we choose to go with a hierarchical representation using a treemap visualization and a network visualization using an arc example.
The next step was to create some data manipulation code using javascript to get the API returned data into a format that protovis visualizations could understand. This is included in the source download.
Using the treemap, we were able to view total aggregation counts for the "Media, Tags, Time, Events and Facts" and significance values for "People and Places". This created a density illustration based on the example query where you can visualize the importance of a particular value based on the query parameters.
The image below illustrates the returned treemap based on the query defined above. You can see that majority of query responses were "news" related and occurred "in the last quarter". Also included are the significant "people" , "places" and "events". We could certainly get more complex with the layout and information contained in the tree visualization, but this provides a good starting point to look at the hierarchical data in a compact way. This provides similar value to that of tag cloud visualization except it allows different calculations to merge.
Here is an example of some of the code used to generate the protovis treemap.
The next example using Protovis was to use the same query utilizing the arc visualization. This visualization provides a way to illustrate network related information and in our case the query for terms "Arab Spring" and "Travel Events" presented a decent way to look at this information.
Similar to the treemap you can see "time, people, media types, events and places", but are able to trace the specific events over time. For example, you can see the relationships between the various actors in the query and specifically the density around the data relationships. Here is an example code snip used to generate the protovis arc.
This hopefully provides a simple illustration on how to use simple tools that can consume JSON fairly easily to create a fairly complex visualization using the Infinit.e API. By adapting the approach and code from this article, you should have working examples using different Protovis visualizations from the data returned from the Infinit.e API in couple of hours.
References
Working with Data in Protovis
jQuery: The Write Less Do More JavaScript Library
Protovis graphical approach to visualization
d3.js Data Driven Documents
Infinit.e Knowledge Query API
-- CM
9 notes
·
View notes
Text
Is there a downside of using Open Source Software?
Open Source Software (OSS) is no longer the red-headed stepsister it was just 5 years ago. Although there has been both successful and not-so-successful OSS products, the concept of OSS is continuing to gain steam in the marketplace. But, is there a downside of using OSS, particularly for mission critical enterprise applications? First, let's define "downside" as risk relating to quality of software and supportability. There's no argument that OSS has lower initial cost, but what about Total Cost of Ownership (TCO)?
There are wildly successful, market leading OSS products. What do they all have in common? First, they provided a unique solution at the right time to the market. Secondly, and arguably most importantly, they all have highly active, committed communities. The community is key to success and risk mitigation for the OSS consumer. It is the community, and how it relates to the vendor (sponsor) that determines the robustness and quality of the code, as well as supportability and TCO.
Ironically, the community is also the key to success and risk mitigation to the OSS vendor (sponsor) as well. Neither the consumer or the vendor want to rely on a build-it-and-they-will-come approach, at least from a risk mitigation standpoint. In developing a new OSS product, there is a bit of Chicken and Egg problem. Where should the effort be placed first, community or product, and how much effort should be placed in each area? These are often difficult decisions for small companies and where it may be appropriate to get some OSS mentoring.
Interestingly, what begins as a Chicken and Egg problem quickly turns into a Chickens and Pigs problem. For those not familiar with the Chicken and Pigs problem, it goes like this. The Chicken suggests to the Pig that the two open a restaurant called Ham and Eggs. The Pig notes that only a contribution is required by the chicken for laying the eggs, while total commitment is required by the pig to provide the ham.

Successful OSS communities require both "chickens" and "pigs" in order to develop robust, successful open source software products. However, given the commitment required of being a pig they can be difficult to get on board given competing priorities. Therefore, the establishment of a successful community must ensure there are sufficient pigs and that they are empowered to drive the product in return for committing to and taking accountability for it.
Therefore, the real downside of OSS is poor community support. That said, both consumer and vendors must be highly active in contribution and commitment. For the vendor to be successful, they must create and nurture a community environment that rewards commitment to the product. Similarly, the OSS consumer mitigates risk by becoming highly active in the OSS product community, paving the way to leverage the cost and flexibility advantages. Success is a shared responsibility!
-DS
31 notes
·
View notes
Text
OSINT and Social Media
Open Source Intelligence (OSINT) is an extremely interesting topic with regards to unstructured data analytics. Specifically, how does intelligence from social media fit in the context of OSINT? Government agencies have been slow to embrace social media as a significant source of intelligence. In fact, there seems to be a clear delineation between OSINT and social media information. Perhaps this is one of the reasons for the slow adoption and utilization of social media. But why is public domain social media not considered OSINT?
One must acknowledge there are a few impediments to analyzing open social media. First, the amount of data available is enormous - hence the term "twitter fire hose". Given this reality, technology must be able to account for the exponentially growing amount of data, and be able to visually present the data in a meaningful way to the analyst to add value. The second potential reason is that social media is considered differently than OSINT may have to do with privacy concerns, particularly here in the US. Personally Identifiable Information (PII) must be protected. DHS has been fairly transparent in their approach to protecting PII. Although one could argue public domain information is open to anyone regardless if the source is a news reporter or twitter. The end result is the same.
Finally, and perhaps most importantly is the perception that social media is unreliable due to examples of deliberate misinformation. Therefore, the existence of misinformation is often seen as a reason not to consider social media for credible intelligence. However, one must ask if misinformation is valuable in analyzing trends and patterns? Here at IKANOW, the answer is an emphatic YES! Remember, unstructured research-driven analytics does not provide an answer, per se, but presents the analyst with the necessary queues and information in order to make a decision and act.
The Arab Spring is an example of "missing it" with social media. According to a recent conference speaker, a highly experienced analyst never saw the Libya uprising coming. This is profound! Would fusing social media, such as twitter, with traditional OSINT data have helped?
Getting back to the original question of how social media fits in the context of OSINT? Social media should be considered a type of OSINT, and then addressed appropriately and ethically through technology, process, training, etc. to ensure PII is protected. By solving the social media intel problem, a level of predictability can be realized by the analyst in deciding when and how to take action. It is the fusion of this real-time information with other unstructured and structure data that creates a more complete situational awareness allowing analysts to anticipate future events.
-DS
17 notes
·
View notes
Text
Performing semi-structured and structured data analysis
Although IKANOW is squarely focused on handling unstructured data discovery (by helping make it semi-structured) and analysis needs we understand that there is a need to fuse that information with more structured sources. In researching and developing this capability our engineering team has created a framework and set of tools specifically designed to handle structured information. It allows users to connect to and make sense of databases in a very similar way to how we handle unstructured data. Guess here might be the "so what" moment.
Infinit.e excels at handling the ingestion and enrichment of unstructured data (RSS, ATOM, news feeds, blog postings, documents etc.) but this ability allows for the handling of structured data found in relational database systems such as: SQL Server, Oracle, MySQL, PostgreSQL, etc, delimited files such as: CSV, tab delimited, etc, and XML documents. This provides organizations with a holistic data analysis platform that can easily harvest most any type of information.
With unstructured data the Infinit.e tool set has a framework and is designed to allow for quick integration with popular text extraction tools and natural language processing technology that is cloud based, open source, commercial off the shelf or government off the shelf to find entities, events, facts, summaries, quotations and sentiment buried within text. Structured data presents a little different of problem since these relationships are already contained.
The engineers at IKANOW have developed a simple to configure data extraction interface and framework that allows you to:
Connect and harvest information from any JDBC compatible RDBMS system
Harvest file sources from a filesystem
Parse XML documents and fuse with other structured and unstructured data sources
Map data elements in a structured source to corresponding entities and events
And specify advanced internal data processing functionality using JavaScript
This functionality provides an extremely powerful framework, engine and fusion capability to merge a wide variety of structured and unstructured data sources into single unified analysis environment. - C.V.
39 notes
·
View notes
Text
DI2E Conference thoughts...
The DI2E conference was held in Dallas, TX this week. It is quite clear the intelligence status-quo is not working anymore. At one time, the amount of good intel data was hard to get and the risk was easy to manage via stove pipe environments. Today, there is an overwhelming amount information available to analysts. But, how does the analyst find good intel (needle in the needle stack) in the explosion of information now available to support the mission? How can analysis decision time be improved in order to take action at the edge sooner? Now, add in the reality of significantly reduced budgets for the foreseeable future.
Risk profiles must also change to enable coalition partners to effectively work towards a common goal, not to mention sharing intel within our own cross-agency teams. As noted by one speaker, the risk profile must move from need-to-know to need-to-share. There is more risk, but the benefit trade-offs are worth the risk. This transformation will not be easy as there are many years of engrained cultural impediments that must be overcome. However, there seem to be some forcing functions that may help drive new behavior, such as budget constraints leading to reduced resources in both people and systems.
Some key themes mentioned at the conference were more effective communication and collaboration, the integration of humans and technology, information management, and the dissemination of information to the commanders. Moving forward, systems and practices must be adapted to the specific mission, and it must be flexible... agile? Boiler plate approaches are not sufficient. Unfortunately, there is no silver bullet.
The message to the solution provider community was also clear. We must do a better job at understanding the customer's needs. However, many times the customer doesn't know what to ask for regarding intel (ISR) capabilities. This is symptomatic of solving problems where a lot of unknowns exist. In many cases, analysts are not sufficiently trained on exiting systems, or on "research driven" practices required for unstructured OSINT analytics, which contributes to the requirements gap. This is not a dig to the intel community, but a reality of the speed and amount of information. The game has changed. Solution provider systems and processes must adapt as well to enable the highest level of mission success to our commanders.
At IKANOW, our focus is on solving the large unstructured (with structured) data problem with Infinit.e, an open analytics platform that is easily configurable to meet the specific mission needs. Additionally, IKANOW's Agile Intelligence approach addresses many of the analyst enablement impediments noted above by incorporating agile Scrum practices, such as a continuous inspect and adapt approach with integrating technical platform SME's with analysts establishing cross-functional teams. The open analytic Infinit.e system with the agile intelligence approach provides the analyst team the flexibility, adaptability and enablement required to solve the intel problems that exist today and tomorrow.
-DS
#DI2E#ISR#OSINT#agile#analytics#budget#intel#intelligence#open analytics#research driven#unstructured data#Infinit.e
12 notes
·
View notes
Link
Please take a moment and support us in filling out this questionnaire. We are in the process of making some important decisions this summer and need guidance from our user community on how best to serve you.
Your answers will go directly to shaping our overall solution and will play a vital role in shaping a open analytics community.
7 notes
·
View notes
Text
Performing Market Research with the Infinit.e Beta
Recently we began to look at how Infinit.e could be used to perform Competitive Intelligence Research (Market Research) as well as Brand Monitoring. Businesses today are in need of tools that enable quick insight into the competitive market landscape. Information about competitors and products play a big role in how effective your business is and can become.
The web has changed the way companies conduct themselves as well as the way individuals conduct themselves. Social Media has enabled people to voice themselves in a way that was not previously available and with that comes a powerful tool that can impact your business, product or brand in both positive and negative ways.
Infinit.e enables businesses and individuals to perform competitive intelligence across markets and learn from trends and patterns shown by historical activity. This enables businesses to gain insight and learn from competition by using the available media and information that is produced on the internet daily. Businesses can also gain competitive advantages by gaining insight into patterns and trends pertaining to products and services offered in the market. Social Media takes many forms and has a wealth of knowledge contained in the form of blogs, forums and social networks to look for events, facts and summaries along with the sentiment of the language.
The short video below illustrates an example of using Infinit.e to research news pertaining to tablet computers and, more specifically, buzz pertaining to the operating systems in the discussed media and by product vendors. We do this by looking at some specific statistics (significance) to identify patterns and temporal aggregations to visualize what is happening in the technology landscape.
Sentiment adds an interesting dimension to competitive intelligence research. We are currently hard at work in integrating sentiment data into our data model and this will offer a new look at how organizations can manage their brand and fuse that information with other media sources to get an analytic picture of their competitive landscape. This capability will be coming later this summer. This capability will allow organizations to follow who is talking about your business or products and identify the sentiment of the discussion taking place over time. You will able to view analytics that will allow monitoring of the buzz generated from a product launch or monitor your brand against social network data.
We are super excited about using this technology to assist in the Market Research vertical and the value it can provide. We also plan to put together additional use cases against specific events and products to illustrate additional functionality and uses.
If you are interested in the topic like the above and would like to help us shape the growth of an open analytical environment please send us a note or join our user forums.
-CM
#market research#infinit.e#beta#analytics#branding#corporate monitoring#whitepost#patterns#trends#technology
4 notes
·
View notes