inagist - Tumblr blog

inagist · 12 years ago

Photo

Hi,

I just moved my posts from Posterous! Do go though my blog for all the new posts.

Its easy to migrate try JustMigrate

3Crumbs app - Are you the local thrifter we all have been looking for?

4K notes · View notes

inagist · 14 years ago

Text

Whats in a tweet?

Well, the title was supposed to mimic Whats in a name.

So whats in a tweet? As you would expect, there is quite a lot. This post is to discuss (in layman terms) some of the features that we extract from a tweet and the algorithms used in the process.

Lets make a list of features.

1. Named Entities.

2. Words in the hashtags.

3. Words in the url.

4. Script.

5. Language.

6. Profanity.

7. Stemming.

8. Sentiment.

9. Commercial Intent.

10. Features for text classification and other ML problems.

1. Named Entity Recognition:

NER involves identification of proper names in a text like a person's name, organizations, locations, date, time, monetary expressions etc. Lets see an example which'll be used throughout this post.

ex: Some residents of Koramangala met PM Singh in Delhi. They expressed satisfaction, but vowed to continue the protest. #LokpalBill http://t.co/xHRbW

where the url may expand to http://ndtv.com/politics/.../koramangala-pm-singh-delhi-lokpal-bill.htm

here, PM Singh, Delhi, Lokpal Bill, Koramangala or "residents of Koramangala" are Named Entities of interest. Though "Some" and "They" are capitalized, they are not that interesting.

There are two broad categories of algorithms used for NER.

1. Rule based models also called Linguistic grammar based models.

2. Statistical models like Hidden Markov Model, Maximum Entropy, Conditional Random Field etc.

While using a rule based approach, a number of things like capitalization, stopwords might help us identify the Entities.

2. URL analysis:

While keeping track of urls might help with clustering, ranking etc, the keywords present in the urls themselves are very interesting. Sometimes the tweet itself might not have entities ("Good Read. http:...") but the urls may have words that are useful to us. If we can recognize a hierarchy in the url ("/politics/"), then such words can be used for classification among other things.

3. Hashtag analysis:

Hashtags for the most part, convey interesting information about the tweet. They may identify or reinforce the subject being discussed, point to the class or a related keyword, give hints about the sentiment etc. Identifying words in a hashtag can be interesting if its capitalized, camel case or multiple words written together etc.

ex: "Lokpal Bill" from #LokpalBill

4. Script Detection:

While the charset of most social media updates tend to be utf8, its not a bad idea to verify it ourselves. Once done, its a question of going over the code points and comparing with known ranges for various scripts. However its advisable to optimize this process using appropriate data structures and counting schemes. We should also keep in mind the presence of other language characters (esp english) in handles, urls, hashtags etc.

5. Language Classification:

While twitter itself identifies a number of languages, its not a bad idea to implement a language classifier ourselves. A Naive Bayes Classifier seems more than sufficient to identify languages. All you need to do is find ngrams from the words in a tweet and use NBC to classify. A number of optimization steps are possible while selecting ngrams and while implementing the classifier. Identifying the language of the tweet is especially important if the tweet has vernacular language words transliterated in English. Such words might affect rule based NER for example.

ex: while the example tweet is unlikely to confuse the classifier watch out for "Singh" and "Lokpal" in other circumstances.

6. Profanity Detection:

While a keyword list or taxonomy based profanity lookup seems reasonable enough, we need to ensure that deliberate or accidental miss-spelling of profane words don't escape our detector. A brute-force method of looking up such miss-spellings or a edit distance based detection, both are promising.

7. Stemming:

For a number of reasons, finding the stem of the words in a tweet is useful. At the very least, this can help in maintaining shorter dictionaries. But stemming may become necessary if we have to find related keywords and index the tweet for searching etc. An implementation of a Porter stemmer is made available by Dr. Porter himself. A combination of stemming programs is

typically used to do stemming. And maintaining a good stemming dictionary is also important.

8. Sentiment Analysis:

A number of supervised learning methods can be used to classify tweets as positive/negative or into more detailed categories. A keyword based lookup can also be implemented but the results are likely to be coarse. Besides the traditional features like keywords, exclamation mark etc, tweets typically have emoticons and "loooong" words to express sentiment.

ex: "They expressed satisfaction"

9. Intent Detection:

Tweets where users pose a question or express an intent, especially commercial indent, are useful for recommendation systems. Similar to sentiment analysis, a number of supervised learning methods can be used to determine if a tweet specifies intent or otherwise.

ex: "but vowed to continue the protest."

10. Features for text classification, clustering:

Training text classifiers using real time tweets is a very interesting way to keep the corpus' up to date. A naive bayes classifier might be sufficient if the number of classes are not very large. However more interesting hierarchies can be built by finding related words and using the words in hashtags/urls etc. Such hierarchies are useful in building taxonomies.

ex: PM Singh, Lokpal Bill, politics.

Note: There are a large number of insights obtained by keeping track of retweet counts, replies, follower/list counts, connected-ness, conversations etc. They are more like the 'dynamic score' for a tweet while the above features within the tweet help find its 'static score'.

#JustMigrated #commercial intent #inagist blog #language classification #machine learning #naive bayes classifier #nlp #porter stemmer #script detection #sentiment analysis #classification #iyottasoft #text analysis #twitter

2 notes · View notes

inagist · 14 years ago

Text

A Light-Weight Work Distribution Queue with ZeroMQ

Working with the Twitter Streaming API, one of the challenges has been to efficiently even out the spikes in the tweet volume. We have a fixed number of resources crunching the tweets comming off the stream. A breaking news happens, tweet's spike and our backends go crazy. We consume tweets off the limited public firehose and the site stream endpoints of the twitter api. Public firehose comes off a single stream but the site stream API has a limitation of 100 users per connections. So we have multiple connections to the site stream API to listen to all the users streams. All the public statuses comming off these streams are combined and duplicates dropped. The remaining ones are processed into the system. We spawn off the processes handling the streams on one of the backend nodes and each tweet coming off that would spawn off a different process to do the processing. Normal loads this works fine since these process are short lived ones and they clear off leaving the system in a healthy state. When spikes occur processes spawned spike, messages in the stream handlers spike and sometimes nukes itself with no memory. So we then suppressed the process spawns if the process count was too high on the node. This meant lost tweets but a few missed tweets were ok in our scheme of things. Thats when the message queue of the streamers started going bust. Too many messages comming in from the stream, and we were not clearing the queue fast enough. Now process message queues can shoot you in the foot if they are not managed properly. If not cleared fast enough they will use up all your memory and bring you down real quick. Now we did not want to introduce a message queue external application in the mix and wanted a simple solution to the problem at hand.

ZeroMQ looked ideal, sits as a library, nice NIF Erlang bindings erlzmq2, of special interest was the swap option which would start bufferring messages onto disk once the in-memory limits were exhausted. We played around with a few configurations and this how we are now setup. Each of the streaming processes connected to twitter pushes the message from the http stream into a PUSH socket. We coded up a ZMQ device consiting of a PULL and XREP/ROUTER socket combination. Each of the consumers was a REQ socket. The PULL-XREP socket allows us to balance the load to the different REQ sockets. Now each of the consumer process depending on the load on its node could decide the rate at which it will pull messages from the sink. The PUSH socket is SWAP enabled so it swaps excessive messages on to disk as needed. Overall it works like a charm. The ZMQ device is implemented as a simple gen_fsm, with the PULL socket being in passive mode and the XREP in active mode. A message from one of the REQ sockets triggers the cycle causing a blocking read on the PULL socket. The nice thing about the XREP socket is that it can route messages back to the original requestor. The moment it gets a message off the PULL socket sends it to the REQ socket and then waits for the next request. Overall works like a charm, for our current load which is about 10M tweets a day.

Here is how the PULL-XREP device looks like

https://gist.github.com/1414361

#JustMigrated #erlang #inagist #streaming api #twitter #zeromq

0 notes

inagist · 14 years ago

Text

Why Erlang?

I often get these weird looks whenever I mention that inagist is written in Erlang. So here is some of the key areas where Erlang is a winner for us.

What we do

At inagist we try to summarize a real-time stream in real-time. Currently we work on top of the Twitter stream api. By summarizing I mean filter in real time the popular tweets in a stream and group tweets based on trends. See it in action at justinbieber.inagist.com, libya.inagist.com (try with chrome / safari for best results since it uses websockets). We do this summary on a stream which could be combined in any number of ways. ie; a users own stream (my stream), a keyword based search stream (libya.inagist.com), keyword + geo-location based tweet stream (sxsw.inagist.com).

Lightweight Processes

The key differentiator here is the real-time nature of how we summarize the tweet stream. Instead of possibly persisting each tweet in the stream and running off-line analytics we maintain limited caches for each user and keep popping in and out each tweet as it gains popularity or when key words are repeated in the stream for trend detection. Here is where a key aspect of Erlang fits in. Each of the stream consumer is modelled as an Erlang process, being light weight and isolated. It essentially is a proxy for each user provisioned. It receives tweets from the stream, manipulates the caches and responds to api queries for serving data. Each of these stream consumers are gen_server implementations, tied in a supervisor chain. In case one of the consumers goes down the supervisor brings it back up with no impact on the rest of the user base.

Messages

So how do we couple this to the stream of incoming data, with messages. Each tweet is delivered to the consumers as messages from the stream api client. Each tweet consumer is part of a process group tree spanning across machines. The moment a tweet is received from the n/w it is spawned off as a seperate process which json decodes the tweet and send messages to the distribution tree. The message trickles down to the consumer process which does its job of cache updations. Being asynchronous messages the client is not concerned about how many consumers there are to a tweet. If a consumer process is interested in a tweet it can consume it. Being a decentralized delivery it scales when load of incoming tweets goes up or if the number of consumers increase.

Distributed Erlang

As the number of consumers increase another key aspect of Erlang comes into play. Distribution across machines. Consumer processes by design are known only by a process id. Erlang works with local process ids the same way as it does with distributed processes. As long as a consumer is part of the distribution tree the tweet will be delivered to the process. This helps us in easily scaling out. Individual machines could fail independently without affecting the cluster as a whole only users provisioned on that machine are offline for the time the machine is offline.

Real Time Delivery

To be as real-time as possible we prefer to deliver over websockets to connected clients. Messages come into play here again. Each stream consumer generates messages as its caches figure out a trending tweet or trend. Web socket clients tap into this message stream, convert them to JSON and deliver to the client. Our chrome browser plugin is a websocket client which utilizes this delivery model. The extension notifies via desktop popups when ever a trend is detected or a tweet gains popularity above a certain level. We also bring in a different angle to real-time search with the extension. When a trend is detected the extension automatically starts searching for prominent tweets with those detected trends.

Streaming Search

I have written previously about how we use Riak for search and storage. In addition we have built some custom stuff which enables streaming search. Whenever we index documents in Riak we also send out the indexing data as messages to the search infrastructure. We also send this index terms as soon as a tweet is received on a seperate index. Here we have process waiting for <index, field, value> tuples to match against and notifies waiting processes of a document matching the search criteria. We currently support ==, and, or, >, <, >=, =< operators so we can detect any tweet containing sxsw (==), justin bieber (and), documents containing a text or lat long within a bounding box etc. Stream consumers use this to get real time filtered tweets from the stream. Also the chrome plugin taps into this search stream to notify a user whenever a tweet matches a detected trend or an explicit search query. This is really powerful since we now automatically figure out what a users interest topics are by way of trends and we can let the user know whenever there is something matching this interest topic in real time. This whole streaming search works with such low over head because of the nature of the the message based architecture that we can stream this all the way upto the browser typically under a second or two. You can see it in work when you click on the "Live Stream" heading on pages like these justinbieber.inagist.com.

I have just given a high level view of where Erlang acts as a differentiator but it should give some insight into why we do Erlang all the way. Drop in a comment or @jebui and will be happy to give more information if needed.

PS: I have not heard any of Justin Bieber's or Lady Gaga's albums they just happen to have a very active tweet stream.

#JustMigrated #erlang #inagist #real time search

2 notes · View notes

inagist · 15 years ago

Text

Searching with RiakSearch

I had in my previous post mentioned what we do with the search functionality at inagist.com. This post will look into the technical details behind the implementation. We use Riak as our storage layer, RiakSearch was a natural add on to it. I will try to detail my understanding of how riak search works and how we use it.

At its heart RiakSearch is an inverted index of terms to document id's. The inverted index maintains an ordered set of document id's, the merge_index backend which stores this index, splits this across various files. Specifically the backend has buffers which maintain the index in ETS as well as files, and segments which are files using a custom format to store ordered keys associated with an index, field name, field value tuple. Segments store metadata information regarding file offsets for key lookup and are loaded into ETS at startup for faster access. Additionally bloom filters speed up lookup in each offset. Buffers are periodicaly merged into segments, and segments once created are not updated except for the merge of segments into a single segment. Very much like the BitCask store for Riak. All this happens at the vnode level and riak core sits on top of all this and distributes the operations across vnodes. Index name, field name and field value are used to determine the hash for mapping to a vnode. At indexing time a document is split into postings which have index name, field name, field value mapping to document id and a bunch of properties. These are batched and send to vnodes responsible for each hash in parallel.

Queries are broken into a set of logical operations which combine each individual matching term and brings up a final list of matching documents which are sorted and ranked. A query like "tweet:facebook email" is broken into something like "tweet has facebook and email". This translates to a logical and of docs having tweet:facebook and tweet:email, these operations are then send to the vnodes to stream the doc-id's matching this operation. The doc id's are then merged in order, via a merge sort since the keys are already sorted. This results in a final list of doc id's matching the query and the properties for each doc. The results are sorted based on these properties and finally returned. The properties have a term frequency for each doc and a pre-plan operation give the document frequencies for each term allowing to sort the docs based on term frequency and inverse document frequency.

Now that was a whirl-wind simplified wrap up of my reading of the search code. To note here is that its a very performance aware implementation for indexing and simple queries. Queries with low cardinality terms could throw away your search times, there is something in the works for this specific issue with inline fields. Also queries which could return millions of rows are also possible memory busters, even if you give options to limit the number of results these are operations which happen after the full results of the operation are in memory. Both of these re-iterates the fact that this is as the name implies "RiakSearch", an addon to make your life easier when working with Riak. The implementation is tailored for mating with the map-reduce operations for riak and that is readily exposed via all the interfaces to riak search.

How do we use it?

The search box on inagist.com is directly wired into riaksearch. To prevent our queries from leading to memory exhaustion we do a couple of tricks. As previously mentioned our backend is fully in Erlang and we directly talk to Riak in Erlang. We directly call the search_fold on the riak_search_client from our code and break the search operation when we have enough results. Our keys, the tweet id's are stored as negative numbers so that the sort ordering of the keys means we get the first n docs ordered in a latest first manner. We then rank them in this limited set.

The next place we use search is for threading conversations on the tweet detail page. I had in an earlier blog post mentioned how we did that with links and and link-map-reduce operations. With search we just index the replies against the tweet it is in reply to and bring it back in from search on the reply to field. This is better since link updates modifies the whole document un-necessarily, where we only needed the meta-data on the tweet to be updated.

Another place we plugin search is to run our clean-up operation. We index the tweet timestamps to the minute granularity and cleanup tweets older than a certain time period. Getting to older tweets without search would have meant we maintain the ids separately to get a handle on which id's to flush out.

While you can choose to have riaksearch index all documents that are stored in your riak cluster via a pre-commit hook, we decided to trigger the indexing via our own calls into riak search. Two reasons to this, we found the pre-commit hook fail a couple of time with a timeout under heavy load, also our indexing needs meant we index the text in a tweet at a point when the tweet was determined to be indexable by the app and not at the point of insert into the back end store. Params like the time stamp however are indexable at insert time.

Final Thoughts

RiakSearch perfectly complements Riak key value store. It frees you from having to access documents by id alone and managing your data is simpler. The fact that it works well with existing java code for text analysis is also worth mentioning. Its still in beta so I guess things are only going to get better from here.

#JustMigrated #erlang #filtering #inagist #real time search #riak #riak search #search #twitter

0 notes

inagist · 15 years ago

Text

Search - ranked for social relevance

We recently pushed out a search box on inagist.com, which looks pretty normal but does a whole different take on twitter search. A normal twitter search shows a couple of prominent tweets and the whole real time result stream. We give it two different views. One a search within your follow circle, and another the broader universe, additionally not all tweets are indexed but only the prominent tweets. So a search for "facebook email" after I login to inagist shows me two sections, sorted by activity, allowing me to quickly catch up on something, as perceived by people I'm interested in and the general public.

We take this a step further to apply to trends that we detect in your stream or the channels you follow. As I'm writing this post I get an alert from the chrome plugin that "Prince William" is trending in World News. I click through on the trend and get a gist of the tweets which caused it to trend in World News and also a bunch of tweets from my follow list on "Prince William".

Hope this enhances your experience on inagist.com. We also use this search to power our Related Tweets widget, which you can see in action on the LiveMint blog.

Behind all of this is riak search from the wonderful guys at Basho. I will get more into the technicals of the search and experiences of using riak search in the next post. In the meanwhile do search around and let us know how you are liking / disliking it.

#JustMigrated #erlang #inagist #personal trends #riak search #search #trending topics

0 notes

inagist · 15 years ago

Text

Your news pushed your way as it happens - inagist.com notifications from notify.io

At inagist.com we want to make sure you get your news as it happens. The traditional web interface has its limitations in realizing this experience. While we would like you to stay glued to our site seeing new tweets and trends or keep watching the stream flow by, we guess you might need to step out once in a while :) So we now integrate with notify.io and push events as they happen to you. You decide how you want to receive it, on your iphone, desktop, SMS or IM.

To get started with push login via Twitter to inagist.com, go to your settings page and add your email id or md5 hash to the Jabber/Gmail/Notify.io text box down below. Whenever something of interest comes by next we will send it your way. Make sure you allow events from inagist in your notify.io settings page.

We push tweets which have crossed a certain threshold in terms of activity, trends picked up in your stream and trends picked up in your favorite channels. All of these are tagged nicely to optionally filter them nicely on your client. Tweets are tagged "tweet", personal trends as "trend", and channel trends as "channel_trend" with an additional tag of the channel name. So trends from worldnews.inagist.com is tagged with "channel_trend worldnewsgist".

In the rare chance of notify.io being down we do have our own xmpp bot you can befriend at [email protected] from the jabber id you have configured (md5 hash will not do) above and you will get xmpp messages from the inagist bot directly.

Try it out and let us know, on how we can improve things.

#JustMigrated #inagist #jabber #notifications #notify.io #xmpp

0 notes

inagist · 15 years ago

Text

Link-Map-Reduce in Riak an example from inagist.com

My last post felt a little incomplete without some code backing it up. I'm following it up with a code sample of how exactly this map reduce is wired up.

I will walk through how we do the "Popular Replies" section on the conversation page. Again here is a @BarackObama tweet, with more than a 500 replies. Popular replies extracts only those replies which have been further replied to, re-tweeted or a reply from the author of the tweet itself. Right now its picked out 1 of these 500+ replies.

Data Model

Resonses to a tweet are captured in a bucket of its own <<"tweet_responses_bucket">>. Each tweet is keyed by its tweet id as a 128 bit binary <<TweetId:128>>. Response details are not stored directly on this resource but a linked value in a bucket called "tweet_responses_subkeys_bucket". Responses are stored as links on a resource keyed as <<TweetId:128, (ResponseId rem 10):8>> in this bucket. This resource is added as a link on the {<<"tweet_responses_bucket">>, <<TweetId:128>>} resource and tagged as <<"tweet_response">>. A reply is recorded as a link of the form {{<<ResponseId:128>>, <<ResponseAuthorId:128>>}, <<"reply">>}. A link is represented as {{Bucket, Key}, Tag}, this link does not point to a valid bucket, key pair but is purely for our own interpretation.

Here is how it would look

<<"tweet_responses_bucket">>

----------------------------

|----------------------------------------|

| <<20337776197:128>> |

|----------------------------------------|

| Links |

| |

| {{<<"tweet_responses_subkeys_bucket">>,|

| <<20337776197:128,0:8>>}, |

| <<"tweet_response">>}, |

| {{<<"tweet_responses_subkeys_bucket">>,|

| <<20337776197:128,1:8>>}, |

| <<"tweet_response">>}, |

| .... |

|----------------------------------------|

| Value |

| |

|----------------------------------------|

<<"tweet_responses_subkeys_bucket">>

------------------------------------

|----------------------------------------|

| <<20337776197:128,0:8>> |

|----------------------------------------|

| Links |

| |

|{{<<20339861590:128>>,<<18035803:128>>},|

| <<"reply">>}, |

| .... |

|----------------------------------------|

| Value |

| |

|----------------------------------------|

| <<20337776197:128,1:8>> |

|----------------------------------------|

| Links |

| |

|{{<<20337857101:128>>,<<82294968:128>>},|

| <<"reply">>}, |

| .... |

|----------------------------------------|

| Value |

| |

|----------------------------------------|

Code

And now here is the piece of code this does the extraction of the popular replies. The function gives a sorted list of {TweetId, AuthorId} tuples which are then looked up and served.

https://gist.github.com/510070

Hopefully the code is self explanatory. Of interest is the make_local_fun which creates a function reference which can be passed over to a remote node, without the remote node having a copy of this compiled code in its path.

Feel free to comment on anything I have overlooked or could be done better :)

#JustMigrated #code #erlang #map-reduce #riak

0 notes

inagist · 15 years ago

Text

Riak at inagist.com

At inagist.com we have been using Riak and yes we are loving it. We moved away from Cassandra after it started taxing our limited resources. The nice thing about Cassandra was the data model. Super columns allowed us to store metadata for a resource as needed. For example the retweets and replies of a tweet were stored in their own super columns associated with a tweet and we could pull it out as needed. Concurrency issues were also not a bother. We could do simultaneous updates to columns and super columns and not worry about data consistency issues. This is seriously tricky when maintaining tweet statistics. Popular tweets keep getting retweeted and replied to concurrently by many people.

When looking for alternatives Riak was our first choice primarily because of it being in Erlang and since it had a map-reduce option which looked seriously promising. The ability to have a choice of backends was another compelling factor. Here are some of the interesting stuff that we have worked out in using Riak.

Using the Data model to our advantage

At the heart of Riak everything is a key-value. All metadata is associated with the value and has to be read and updated as a single unit. The most interesting metadata is of-course the Links that you store along with a keys value. Interesting because Riak's map reduce has an extra option called link walking. This allows you to filter the links on a document by tag or bucket and feed the linked documents to the next phase. Infact Riak's map-reduce allows you to have any combination of link, map, reduce options to process your data. And yes these are optional too. So infact you can have a link-reduce, link-link-reduce or link-link type queries too.

Why is this interesting? It allows us to store metadata on a seperate resource and link it to the main resource. Meaning we could have say 10 buckets storing the ids of the retweets as Links and the main resource has a link to these 10 buckets. You could parse through this list of with a link-link query. This reduces the contention on one resource for updates, parallelizes the read and spreads it across the cluster. We store replies and retweet details of a tweet in this model.

A link has three attributes Bucket, Key and Tag. Its supposed to refer to a Bucket and Key if you intend to further get data out of the linked document. But if you know what you are upto this allows to do for some serious extra data management. We currently store some tweet meta-data in Bucket and Key with a well known tag. When we later want to query for all replies to a

tweet we do a link-link-reduce on well known tags and get the replies or retweets out. I'm not getting into specifics but it should give you the idea.

Interfacing with Riak

Most references point to using Riak via the HTTP interface or via the protocol buffers client. Great if you are working from a non Erlang environment.

We currently use the built in client for Riak over the protocol buffers client. With the main processing being in Erlang, and being a distributed app at that, there was no point in going thru extra layers to get into Riak. This also gives us some interesting options, like for example the "Your Friends" tab on the conversations page that you see once you log into inagist.com. This does a link-link-reduce-reduce where an extra reduce talks to remote erlang process for the logged in user to filter out only replies from his followers. See it in action on a popular tweet like this one from @BarackObama. Mind you the "Your Friends" feature will work only after your account is enabled on inagist.com, but the Popular tab works for anyone and pulls out only the replies which are of interest.

Storage options

And yes the back-end, currently we run the innostore back-end based on Embedded Innodb. This kind of makes Riak a distribution layer over a trusted storage layer. Of the back-ends available this has worked best for us giving a consistent performance along with being reasonable on the resource usage. But the biggest factor here is that it gives an option to plugin what you want like the trial we did with Tokyo Cabinet.

Our biggest bottle neck now is the disk space, we keep pruning the data set at a failry fast pace, roughly one week of data is all that we hold. We get a little above 5 million tweets a day from the twitter pipe and we keep cleaning out as the disks fill up.

A big thank you to the guys at Basho for Riak, its seriously awesome.

#JustMigrated #erlang #map-reduce #nosql #riak

0 notes

inagist · 15 years ago

Text

How we rate content at inagist.com

Its been often asked of us, "How do you decide scores for the tweets that you show on inagist.com?" Well here is how we determine what to show and what not to show.

We divide tweets coming in from the Twitter stream into 2 categories, ones which have a URL and ones which do not. The assumption being that tweets with URL tend to be teasers to the content at the linked URL. Tweets from different users mentioning the same URL (possibly through a URL shortner) are scored against the URL.

Another basic assumption is how we classify your follow community. The follow community is composed of the people you follow directly and people whom you follow indirectly through lists. If you follow a person directly we give a higher score to activity from that user opposed to activity from a user in one of the lists that you follow. The reasoning here being that people you directly follow is what you see on most Twitter clients by default and possibly people whom you care about more.

Once your twitter id is enabled on inagist.com, we fetch your follow community and start watching the twitter stream on your behalf for activity from, or about, the people in this community. A retweet or a reply to a tweet makes an activity on a tweet. A retweet of a person you directly follow by a person in one of your lists, gains a higher score than a retweet of that person by someone not in your community. And a retweet of that person by a person you directly follow gains an even higher score. Similar scoring applies to replies except that reply from a person not in your community to a person in your community does not count. Conversations between people you directly follow gain higher scores than conversations across people in your lists.

We present tweets which have a high score sorted by time on your user page. And yes options are coming soon to sort by score and filter by the score.

We do all of this in real-time for each enabled user on inagist. To be reasonable on the costing behind all of this we use bounded caches for scoring tweets. So if you have more people in your follow community you will see tweets expiring faster from your main page. We do archive these tweets which we discover, so you can go back and see those tweets which we had picked out in the past. Clicking on a list name pulls in tweets from the archive for that particular list.

Conversations page gives you more context on the activity surrounding these filtered tweets. Clicking on the activity icon takes you to the conversation page where all the replies to a tweet are shown in order. We also present two different views on these replies. Popular replies are ones which have been further replied to or retweeted. Replies from your friends brings out only replies from your community.

We encourage you to sign in and see this in action for yourself. Also get into Twitter and start following lists of your choice on topics that interest you, or create lists of your own and curate your sources. May we recommend verified, tlists and scobleizer for a variety of lists to get you started. Continue using twitter with your favorite client and inagist.com to surface interesting tweets from all your sources, we assure you, you will not miss a single one of those interesting 140 characters.

#JustMigrated #filtering #inagist #ranking #real time #relevance #scoring #tweet classification

0 notes

inagist · 15 years ago

Text

Introducing inagist.com

How often has twitter been your source of breaking news? Think Mangalore plane crash, gulf oil spill updates. How many times in the recent past did you learn about those tid-bits from your twitter stream? Think Google Pacman doodle, iPhone 4G leaks. Now think how often have you felt that an important tweet has gone un noticed because of the volume of tweets that keeps coming into you, or of how often you have refrained from following people because you are not able to keep up with your existing tweets.

If you are not so much into twitter but often heard about it, have you wondered if you could get information that mattered to you out of twitter? Could twitter be your portal to the web?

We present inagist.com, a real-time tweet filtering system, which sifts through those thousands of tweets which are flowing in to your account and show you just those ones which are interesting, or thought to be of value by other people. We rank tweets based on the perceived relevance of the content of a tweet. We then show you those tweets which are relevant, sorted either by time or relevance. You can do the normal activities then on a tweet of replying, retweeting, favoriting etc. Additionally we pull in url summaries, media previews in place to give a complete reading experience. Tweets are pulled in from people and lists you follow, so go ahead either follow that person or add him/her to a list of yours, or follow those lists that @scobleizer has painstakingly put together. We advice you to follow directly those people that matter to you so your twitter page looks clean and interesting, your inagist page will sort and rank tweets from all those numerous others. Once you authorize your Twitter account with InAGist we regularly sync your account information to keep abreast of your follow changes. Go ahead try it out at http://inagist.com/<your twitter handle>, we are at jebui, chjeeves, netroy.

For those twitter unaware we present inagist channel pages. These are curated twitter accounts on specific topics, world news is at worldnews.inagist.com, India news at india.inagist.com, geeks at geek.inagist.com. We have created twitter accounts which follow sources on these topics so you can come in and get your taste of tweets. Tweets from these acounts are ranked as explained before so you get to see only those tweets from these sources which are relevant. All those channels are listed at twitter.com/InAGist/channels/members.

Now we are still in early stages of development and not opened to the whole twitter crowd. Do check if your account is already enabled at http://inagist.com/<your twitter handle>, if not please authorize inagist app and we will enable as we go along.

Your feedback is important to us, feel free to get back with comments, feature requests, bouqets, brick bats - @inagist is the preferred channel. InAGist is a product offering from Iyotta Software.

#JustMigrated #inagist #real time #twitter

0 notes