Topic 1: Live Social Observatory

One key activity of NExT is to gather and monitor the UGC sources related to a city. Along this direction, we have carried out large-scale, real-time crawling, indexing, and analysis of UGC sources related to content (in both text and image), location, topic and local mobile apps. Figure 1.1 lists the UGC sources that we crawled and their size as of June 2013; which contains over 1.9 billion of records. On top of these, we developed a range of first and higher order analytics to help users make sense of the UGC content. Figure 1.2 presents the interface to our Live Observatory System, which offers access to five sub-systems: the “live crawler”, and analytics to “location”, “people”, “topics” and “organizations”. An example of first order analytic is presented in Figure 1.3, which shows the keyword clouds for the search term “Instagram” for tweets before and after it was acquired by Facebook on 10 April 2012. This example clearly shows the power of live analysis in quickly evolving circumstances and the emergence of live events.

The “Live Observatory System” can be accessed via: “http://www.nextcenter.org”. More details of our system are summarized in the paper [1].

Figures 1.1: The UGC sources monitored and their size (as of June 2013). The total amount of UGC records collected is approximately 1.9 billion.

Figure 1.2: Interface to NExT Live Observatory System

Figure 1.3: Keyword clouds: by sampling tweets that referenced Instagram, we generated (a) the first cloud on 9 April 2012 and (b) the second on 10 April 2012, before and after the company was acquired by Facebook, respectively.

We currently carry out research in the following three areas.

First is the development of intelligent crawler. One key problem in social media research is the ability to gather sufficiently large number of representative posts related to a given “topic”. This is particularly challenging when the “topic” is hot and there are crawling limitations that restrict the number of posts that can be crawled. To address this problem, we devise a strategy to crawl representative posts by leveraging: (a) multiple keywords; (b) key users; and (c) known accounts related to the “topic”. We have formulated sub-optimal solutions to identify relevant and related keywords and key users, and we are integrating them towards intelligent and robust crawlers.

Second, we are refining our architecture to store the huge amount of UGC posts by adapting a distributed architecture that uses Hadoop, HBase and Elasticsearch index. For video and images, we are refining our media search engine to index the over 200 millions of images that we have crawled.

Third, we are working towards a framework for sharing our UGC resources. As it is meaningless and illegally to share the raw UGC data crawled, we are working with Web Science Trust to develop a framework for sharing the higher level analytics derived from these UGC posts. In the meantime, we are working on several large scale test datasets with ground truth for research community in the areas of: product and brand tracking; location semantics; event detection, and mobile videos on public events, etc.

Representative Publication:

[1] Tat-Seng Chua, Huanbo Luan, Maosong Sun, Shiqiang Yang: NExT: NUS-Tsinghua Center for Extreme Search of User-Generated Content. IEEE MultiMedia, 19(3): 81-87 (2012)

[2] Huanbo Luan, Dejun Hou, Tat-Seng Chua: NExT-Live: A Live Observatory on Social Media. MMM 2013: 514-516