• Winter 2014-15

    Volume 37, Number 2

    Virginia Tech Magazine, winter 2014-15

  • From predicting events in Latin America and the Middle East to modeling the response to a nuclear detonation in the nation's capital, Virginia Tech leads the way in the big data movement.

  • EMBERS »

    A system developed by the Discovery Analytics Center that provides a continual, automated analysis of open-source data to forecast significant societal events.

    NDSSL »

    For a disaster resilience study, the Virginia Bioinformatics Institute's Network Dynamics and Simulation Science Laboratory created a simulated environment using big data methods to evaluate disaster preparedness policies and interventions.

  • EMBERS forecasting

    1)  The Discovery Analytics Center collects open-source information from satellite images, Facebook, Twitter, news, economic indicators, Google search volume, and more.

    open-source information

    2)  Data flows into the EMBERS system, which processes between 200 and 2,000 messages per second. A variety of data filters and distinct models are trained to identify different patterns.

    EMBERS dashboard

    3)  All the models' outputs are fused in a final model that forecasts an event. An EMBERS dashboard displays elements related to a single warning, including a warning generation pipeline, geolocated warning hotspots, news content, warning location, and original article.

    4)  Alerts of significant events are emailed in real-time as they are generated. EMBERS sends 40-50 alerts per day to its clients.

  • The EMBERS Project Can Predict the Future With Twitter

    by Leah McGrath Goodman
    Newsweek, 3/7/15

    "[Virginia Tech] offers a glimpse into just how much 'big data' has changed the game by magnifying the U.S. intelligence community's ability to forecast—with phenomenal accuracy—human behavior on a global scale by scouring Twitter, YouTube, Wikipedia, Tumblr, Tor, Facebook and more."

    Read the full story at »
  • Selected big data efforts at Virginia Tech

    • The Bradley Department of Electrical and Computer Engineering at Virginia Tech compiles radio astronomy data to advance knowledge of cosmology, pulsars, and other heavenly phenomena.

    • The Virginia Tech Transportation Institute collects and analyzes massive amounts of video and sensory data from cars, trucks, and motorcycles as part of its naturalistic driving studies.

    • The Pamplin College of Business has launched the Center for Business Intelligence and Analytics, an interdisciplinary resource that encourages big data research, study, and applications in the business world. The center's goals also include developing an interdepartmental minor in business intelligence and analytics. With specialty areas in social media analytics, text analytics, health analytics, and more, the center is working with the Virginia governor's office and many large corporations on big data and business analytics projects. In one successful project focused on quality control, researchers mined online data to detect product defects.

    • Graduate students and faculty from the College of Liberal Arts and Human Sciences, the Discovery Analytics Center, and University Libraries collaborated with the University of Toronto to mine data from more than 100 different newspapers chronicling the 1918 influenza pandemic. The project, which sought to understand how newspapers shaped public opinion and represented authoritative knowledge during the deadly pandemic, was one of 14 projects approved for funding in the National Endowment for the Humanities and the Social Sciences and Humanities Research Council of Canada's Digging into Data Challenge.

    • Researchers in the Virginia Bioinformatics Institute and the Center for Peace Studies and Violence Prevention used disease transmission models to study criminal incarceration, examining how incarceration can be transmitted to the family and friends of those who are incarcerated. Synthesizing publicly available data from a variety of sources, the researchers generated a realistic, multigenerational synthetic population with contact networks, sentence lengths, and transmission probabilities.

    • Part of information technology, Advanced Research Computing (ARC) supports cutting-edge computing resources, including the Blue Ridge and HokieSpeed supercomputers, that serve researchers across the university. Virginia Tech's investment in high-end computational architectures is paying off for researchers by processing data generated from projects ranging from geospatial image data to genomic assembly.

    • In the College of Natural Resources and Environment, the geography department is using massive amounts of radar data from the National Weather Service and the National Climatic Data Center archives to create a 3-D immersive tornado in the Moss Arts Center's Cube. Researchers hope to unlock the power of big data to improve the understanding of the underlying physics of atmospheric phenomena and provide instruction in the area of atmospheric dynamics. In addition, the college's new Center for Natural Resources Assessment and Decision Support is using big data sets to model the future sustainability of Virginia's resources, beginning with forestry resources.

    • For his efforts in big data and cloud computing, Professor Wu Feng—featured in a Q&A in this edition—is also being featured in one of Microsoft's global advertising campaigns. One of the ads credits Virginia Tech scientists and engineers with harnessing "supercomputer power to analyze vast amounts of DNA sequencing information and help deliver lifesaving treatments" in the fight against cancer.

    Photo gallery of the commercial shoot with Feng on campus »
  • Virginia Tech offers new data-centric majors

    As the world becomes more data-driven, Virginia Tech is incorporating aspects of big data into classes and academic programs across campus. Additionally, Tech is offering two new, interdisciplinary undergraduate degrees based largely around big data: environmental informatics in the College of Natural Resources and Environment and computational modeling and data analytics in the College of Science.

    The environmental informatics major incorporates information technology, data analysis, natural resources, geospatial science, and ecological modeling to enable students to explore and apply information science to the sustainable management of the natural world.

    Students develop skills in remote sensing, ecosystem management, spatial data analysis, statistics, Web and database management, and sustainability analytics that can be utilized in many environmental professions and applications, ranging from forestry and landscaping mapping to pollution modeling and watershed ecology.

    The College of Science's computational modeling and data analytics major draws together mathematical modeling, modern data science, and high performance computing. The degree is targeted at students from a variety of disciplines, especially those with a deep curiosity for understanding how the world works by developing computer simulations and mathematical models.

    In addition to algorithm design and modeling, the major will also address important ethical considerations, ranging from data collection to the responsibility of a scientist to present clear and unambiguous explanations to those responsible for making public policy.

    Winter 2014-15

    Sowing the Future: Part Two

    Fortune-telling and other uses of big data at Virginia Tech

    The Sensors Within: Goodwin Hall knows where you are

    The Science of Virginia: Inaugural science festival captivates thousands

  • Digital
    Virginia Tech Magazine app APP »

    digital "flip" edition 'Flip' version

    view or download PDF PDF


    and other uses of big data at Virginia Tech

    by Madeleine Gordon and Mason Adams

    A group of scientists at the Virginia Bioinformatics Institute is using network dynamics to analyze the patterns of people's movements to help decision-making in the face of a natural disaster or epidemic outbreak.

    So long, crystal balls.

    Through the use of big data, Naren Ramakrishnan and his team from the computer science department's Discovery Analytics Center (DAC) may make forecasting the future as commonplace as forecasting the weather.

    The term "big data" refers to the use of algorithms and other tools to train computers to spot trends in collections of information that are too massive and complex to analyze with traditional methods. The proliferation of data has accelerated with the integration of computers into our daily lives, from social media on our phones to tracking buying habits at the grocery store.

    Virginia Tech's efforts stand at the forefront of the big data movement, with labs and professors across the commonwealth conducting increasingly data-driven research as the university looks to build additional capacity for future initiatives. Maintaining a strong presence in Blacksburg as well as in the National Capital Region allows for significant collaborations in the domains of intelligence analysis, national security, and health informatics.

    "To Virginia Tech's researchers, big data represents an important opportunity to create knowledge and provide insight by leveraging large, potentially unstructured data sets," said Scott Midkiff, the university's vice president for information technology and chief information officer and a professor in the Bradley Department of Electrical and Computer Engineering.

    Projects like DAC's EMBERS and the Virginia Bioinformatics Institute's (VBI) Network Dynamics and Simulation Science Laboratory (NDSSL), which simulates disasters to evaluate emergency response and disaster preparedness policies, are telling examples of big data's potential.

    Forecasting the future

    EMBERS, the acronym for "early model-based event recognition using surrogates," provides a continual, automated analysis of open-source data—everything from Facebook posts and website searches to satellite images and restaurant reservations made online—to forecast significant societal events such as disease outbreaks, domestic crises, and elections in countries around the globe.

    Once a trend or pattern is recognized, EMBERS applies thresholds learned by the algorithms that process past data and events. If the threshold is met, an alert is sent to a third party for evaluation. Training the computers to recognize trends is not very different from teaching an email system to recognize spam, said Ramakrishnan, the Thomas L. Phillips Professor of Engineering and DAC's director.

    "The science of big data is about designing algorithms that can transform raw data into actionable knowledge or intelligence," Ramakrishnan said. "There isn't one specific, magic algorithm or threshold in EMBERS. There are a variety of data filters and distinct models trained to identify different patterns. All these models' outputs are then fused into the final model that forecasts the event and produces the alert."

    EMBERS now sends 40 to 50 alerts per day to its clients.

    EMBERS: Student protests in Venezuela

    EMBERS successfully forecast student-led protests in Venezuela that initially began due to the attempted rape of a student, but morphed into broader protests against police brutality and other issues. In addition, EMBERS also forecast that the protests would turn violent and that they would spread to multiple cities.

    Spread of protests in Venezuela, January and February 2014

    Spread of protests in Venezuela, January and February 2014


    "In EMBERS, when we say forecasting, we really are forecasting," Ramakrishnan said. "A lot of projects have the benefit of hindsight, and [people] look back and say, 'Oh, we could have predicted that,' but we send forecasts before the event happens."

    Rather than filtering just a few hundred emails, though, EMBERS since its inception has collectively sorted through more than 21 terabytes of data, looking only at a small portion of the world. For perspective, 1 terabyte of data could store 1,000 to 5,000 movies.

    EMBERS processes between 200 and 2,000 messages—a tweet, news item, blog post, or stock value—per second. With such a wide breadth of information, there are bound to be widespread inaccuracies, such as rumors, spam, or news stories that are later redacted. However, EMBERS' algorithms are designed to weed out misinformation, Ramakrishnan said.

    Not surprisingly, EMBERS is getting attention from the federal government; the project is funded by the Intelligence Advanced Research Projects Activity (IARPA), which is part of the Office of the Director of National Intelligence. DAC was one of three teams chosen to compete in IARPA's Open Source Indicators (OSI) program. Starting in April 2012, DAC's team vied for full funding from IARPA, alongside industry competitors Massachusetts-based Raytheon BBN Technologies and California-based HRL Laboratories.

    For two years, the three teams focused their forecasts on about 20 countries in Latin America. EMBERS accurately forecasted several events there, including riots following the impeachment of Paraguay's president in 2012, Hantavirus outbreaks in Chile and Argentina in 2013, and elections in Panama and Colombia in 2014.

    IARPA monitored the three teams' progress while an independent government contractor assessed the quality of forecasts. Each month, EMBERS and the other teams would receive a scorecard evaluating their forecasts based on five criteria: lead time, mean probability score, quality score, recall, and precision.

    EMBERS scored at or above target in most of the categories, forecasting events with a mean lead time of 7.54 days. Of the three teams awarded an initial contract, DAC was the only team to secure a contract for the third and final year of funding. (DAC expects to secure funding to continue its forecasting work.)

    Jason Matheny, OSI program manager at IARPA, said DAC's team has "been able to accurately forecast hundreds of societal events, days to weeks before they occur, with a low false-alarm rate."

    DAC has widened its focus from Latin America to the Middle East and North Africa. Since June 2014, EMBERS has been sifting through information gathered from seven Middle Eastern countries, including Bahrain, Egypt, Iraq, Jordan, Libya, Saudi Arabia, and Syria.

    Because of the geographic change, Ramakrishnan and his team have had to adapt several models to the new region. DAC now has a Middle East expert on its team to help understand the complex linguistics, which vary between dialects and between written and spoken word, and the myriad cultural differences from country to country.

    "In the Middle East, expression of discontent happens differently than in Latin America. You have to have a much better local understanding of how people voice concerns and how they communicate [their discontent], for instance," Ramakrishnan said.

    While forecasting the future may sound fanciful, it holds a number of practical applications.

    "Forecasting civil unrest is useful for people and groups as they make travel plans," Ramakrishnan said. "It also helps governments understand what people are frustrated about, know what the hot-ticket items are, and [decide] what they can do about it. It helps them understand what the citizens' priorities are. What are the most important grievances?"

    Simulating disasters

    Big data initiatives are leading the way to predicting the future—and they are being used to determine how to deal with that future.

    VBI's NDSSL created a simulation environment using big data methods to evaluate disaster preparedness policies and interventions.

    Madhav Marathe, a VBI professor and NDSSL director; Christopher Barrett, a professor and VBI's executive director; and Stephen Eubank, a professor and NDSSL deputy director, led a large team that modeled human behavior using a combination of many data sources to simulate a nuclear detonation in Washington, D.C., depict the behavior of more than 730,000 simulated D.C. residents, and evaluate the emergency response.

    NDSSL disaster resilience study

    A simulated nuclear blast in Washington, D.C.

    NDSSL disaster resilience study


    The light gray gradient indicates the radiation dosage from fallout. The bars indicate aggregate counts of individuals in different health states at the various locations.

    1)  NDSSL collected open-source information (census and infrastructure data, etc.) to create more than 730,000 synthetic individuals in a simulated infrastructure.

    2)  The model tracks behavior and how individuals interact with infrastructure. For instance, availability of power affects ability to communicate, route traveled exposes person to radiation and to risk of injury, and health state determines a person's likely behaviors.

    3)  Decision-makers in public safety and other areas can use simulations to improve disaster resilience by taking proactive measures.

    Using massive amounts of data, including the American Community Survey, tourism reports, transportation routes, cell-tower communication data, hospital registries, power-network data, and surveys of human behavior in disasters, the team generated synthetic individuals to gauge their likely motivations and reactions in the midst of the disaster.

    "The event … allows us to collect information from varied sources and build a synthetic, but realistic representation of the event, as well as what I would call a physical world, the infrastructure world, and a social world," Marathe said. "All three worlds have to come together and be represented meaningfully to do the analysis because otherwise you're missing one of the three things."

    Encompassing a 48-hour span in the midst of a nuclear disaster, the simulation produced several terabytes of data, the result of the unimaginably complex algorithms and computer modeling the team had created. Millions of simulated individuals were incorporated into a single, mineable dataset based on real-world information, Barrett said.

    The team found that even a small increase in the ability to provide functional communication systems would allow people to do a substantially better job coordinating activities such as finding family members. Because humans' first instinct in the wake of a disaster is to use their phone, communication systems tend to falter with the magnitude of texts and calls. Such findings allow the lab to provide decision-makers with better information.

    "This is a really important finding, and this could not have been done in this particular form had we not put all the data together, filled in, made a consistent representation, taken the things forward, and then mined for nuggets within this," Marathe said.

    Said Barrett, "Even though human behavior is a black box in a black box in a black box, we still can come very close to getting very rational, reasonably stated ways that you would expect people to move."

    With the rapid pace of technological advances, information from big data simulations can be generated more quickly than ever. Marathe said the time it takes to run a simulation has decreased from a couple of days to mere minutes.

    In addition to improved technology, Eubank attributes the growth of big data to the changes in the way society collects information.

    "We had no idea that 20 years from when we started a transportation project that it would be commonplace for people to report their location on a minute-by-minute basis to the world," Eubank said.

    Living in a data-driven world

    Scientists and researchers working with big data foresee even more innovation on the horizon.

    In fact, those like Ramakrishnan, Marathe, Barrett, and Eubank—who have made a habit of dealing with the future—see the future of big data happening at Tech.

    "I think that Virginia Tech has provided us with an environment and ecosystem to carry out this research over the [past] 10 years which has been very, very conducive to do this and I certainly value this. Tech has been very supportive of our work," Marathe said. "It is very cool to have an institute that allows us to do things in a very novel and aggressive way."

    Barrett sees their big data research as world-leading, explaining that Virginia Tech's approach to computationally enabled social science and the development of a synthetic information platform are conceptually different from anything else in the field.

    Ramakrishnan also echoes the sentiment that Tech is at the forefront of big data research. "By creating DAC, we have brought together an interdisciplinary group of researchers from computer science, statistics, electrical and computer engineering, and mathematics. We have initiated graduate and undergraduate courses in this topic and hope to be a one-stop shop for the university and beyond in leading research and educational efforts in big data. The IARPA EMBERS project is an example of how DAC has led an interdisciplinary effort in this space, and we have just begun," he said.

    As Virginia Tech's researchers continue to develop new uses for big data, the university has upgraded its computer systems to keep pace and ensure the capacity to house the collected information. Midkiff, the university's vice president for information technology, sees collections of big data as a chance to re-evaluate Virginia Tech's missions and operations.

    The investment in big data initiatives in Blacksburg and in the National Capital Region allows for greater connections with industry partners while also making use of data to better serve society. "By improving the lives of people who actually produce social data, big data is more than just a passing trend," said Christopher Walker, DAC program manager.

    The wave of research also is moving into classrooms as the university presents students with more opportunities to innovate. Many degree programs—computer science, electrical engineering, statistics, and many more—already include big data elements, while two interdisciplinary undergraduate degrees have been introduced (see the sidebar at right).

    "Virginia Tech is working to ensure that all of our graduates are prepared to thrive in a society that is data-driven and networked," Midkiff said.

    No matter what the future holds, big data research has found a home at Tech.

    Madeleine Gordon, a senior English and communication major, was an intern with Virginia Tech Magazine. Emily K. Alberts, formerly the Discovery Analytics Center's public relations and marketing specialist and now the Department of Engineering Education's office manager, contributed to this article.

    Virginia Tech Magazine

    205 Media Building
    Virginia Tech
    Blacksburg, VA 24061

    Produced by University Relations
    © 2015 Virginia Polytechnic Institute and State University

    Leave a Message