11:00 am
Case Study Panel Discussion

Data-Driven Science and Cyberinfrastructure
Live Q&A session will immediately follow

Scientific instruments that transform phenomena in the physical world into digital data and computer simulations of scientific experiments have created a data-driven revolution in the sciences. Scientists have moved from data-starved environments to conditions where research communities are drowning in data. An ongoing census of all pulsars in the Milky Way at the Arecibo Observatory will create about 1 petabyte of data over the next four years. The Cornell Computational Agriculture Initiative is developing high-resolution climate data on a daily basis. Support for these projects will require a comprehensive cyberinfrastructure.

Panelists will discuss how high performance computing is essential to their research, with specific examples from data-driven projects such as the international pulsar project, or PALFA Consortium, based on data collected from the Arecibo Observatory, and the Cornell Computational Agriculture Initiative, as well as how the Cornell Theory Center is developing, implementing and maintaining a cyberinfrastructure to support their research and the plethora of large data sets it generates.

Discussion Topics/Panelists:
The International Pulsar Project
James Cordes, Ph.D.
Professor of Astronomy
Cornell University




The Arecibo Observatory, the world's largest and most sensitive single-dish radio/radar telescope is managed by the National Astronomy and Ionosphere Center (NAIC) at Cornell. Arecibo provides state-of-the-art observing facilities for scientists in radio astronomy, solar system radar astronomy and atmospheric studies. The volume of information being gathered in astronomy today is estimated to be doubling every 1.5 years or so. This huge growth in data volume is accompanied by a great increase in data complexity. Cornell astronomers, along with a consortia of national and international researchers, use the Arecibo telescope to conduct data-intensive surveys. These surveys will produce thousands of terabytes of data. Arecibo data and refined data products will be a unique resource for years to come, providing synergistic opportunities with other large-scale surveys that have been done and with telescopes of the future, including the Gamma-ray Large Area Space Telescope, to be launched in 2007. Access to astronomical data at the Cornell Theory Center will be done in accordance with virtual observatory methods that are now being developed.

James Cordes's research interests include radio astronomy, neutron stars, pulsars, the interstellar medium, the search for extraterrestrial intelligence, signal processing techniques, statistical inference, and topics in computer science. He regularly makes observations using radio telescopes in Arecibo, Puerto Rico, the Very Large Array in New Mexico, the Parkes telescope in Australia, and the Very Long Baseline Array, headquartered in New Mexico. Cordes also makes infrared and optical observations using the Hale Telescope at Palomar and has taken part in joint radio and gamma-ray observations using the Compton Gamma-ray Observatory and X-ray Timing Explorer. He also uses the Hubble Space Telescope and the Chandra X-ray Satellite in his multiwavelength work.

He is currently planning observations using the upgraded Arecibo Observatory and a new multiple-feed receiver system that involve deep searches for radio pulsars. Cordes is also heavily involved in the Square Kilometer Array project, a next-generation radio telescope.


Computational Agriculture
Harold Mathijs Van Es, Ph.D.
Professor, Crop and Soil Sciences
Cornell University




This program involves a collaborative effort between the Cornell Theory Center (CTC) and the College of Agriculture and Life Sciences (CALS). CTC offers support and training in the use of HPC technologies, encourage collaboration as a group to move into a new era in agricultural science research, and quickly bring the results of research to farmers and the general public. With this, HPC necessarily goes beyond fast data processing and includes effective methods for data warehousing and querying, and the next-generation user interface that allows for effective access of HPC facilities from remote locations. Notably, this initiative takes advantage of CTC's infrastructure for Microsoft SQL applications, which allows for the rapid access of large databases and integration with dynamic simulation modeling efforts. Recently-developed methods for the development of high-resolution climate now require terabytes of data storage, which need to be available for rapid access by dynamic models and also be available forcontinuous data mining. The dynamic models, in turn, require HPC facilities to process extensive multi-year simulations for probabilistic assessments of agricultural-environmental processes. In addition, several component projects will take advantage of the integrated efforts in GIS-based Web interfaces and data representation.

Harold van Es, Professor of Soil and Water Management, joined Cornell in 1988. He has extension, research and teaching responsibilities related to the management of soil and water resources. He serves as Director of Graduate Studies and is the lead PI for the Cornell Initiative on Computational Agriculture. He is a Fellow of the Soil Science Society of America and also served as chair of Division S-6 (Soil and Water Management). He was an advisory member to the New York State Soil and Water Conservation Committee, and member of EPA Science Review Boards for FQPA and FIFRA implementation. He has authored over 60 refereed journal articles, co-authored the best-selling book Building Soils for Better Crops, and served as major advisor for 18 graduate students.

Microbial Ecology
John Bunge, Ph.D.
Associate Professor, Department of Statistical Science
Cornell University




Technology and parallel processing have the potential to play significant roles in understanding our environment. The Earth’s biodiversity is immense: science is aware of several million living species, but there are many more that have not yet been identified. We are developing innovative methods to help solve a tricky problem: how many classes or species should we expect to find in a population, given incomplete information? Because it’s not practical to count every organism, researchers sample the population in a certain place and time. By chance, however, members of some species can be missed altogether. To compensate for this, statisticians look at and plot the number of times each species was seen and construct a graphical curve that can be extrapolated back to estimate how many species were observed zero times—that is, the ones completely missed by sampling. In most cases, we are working with populations that contain hundreds or thousands of different species. Some of them appear in the sample many times. Others, though, appear very infrequently. When researchers find that many species are seen only once, or very infrequently, the logic follows that there are other species the researchers didn’t see at all. Parallel processing uses many different computer processors to work on different parts of a problem simultaneously and is one way of increasing the speed at which computers can work, with each processor working on a part of a larger problem. Using the Cornell Theory Center’s parallel processing clusters, Bunge reduced processing time by more than a factor of four thus shortening the time to discovery.

John Bunge holds a Ph.D. in mathematical statistics from The Ohio State University, in addition to degrees in French literature, music, and philosophy. He has been interested in the “species problem” since the mid-1980s, and wrote a review of the subject in 1993 which is now a basic reference. In the past 2 years, along with several graduate students, he has been collaborating with microbial ecologists on the problem of estimating the diversity of microbial populations, especially in the world’s oceans; this is a particularly challenging problem because 99 percent of microbial organisms cannot be cultured in the laboratory. This research relies heavily on high-performance computing, and as data flows into the project from an ever-increasing number of sources (46 sampling sites at last count), the mathematical, statistical and computational demands increase accordingly.

Moderator:
Johannes Gehrke, Ph.D.
Associate Professor, Dept. of Computer Science
Associate Director, Cornell Theory Center
Cornell University



Johannes Gehrke received his Ph.D. in computer science from the University of Wisconsin Madison in 1999. Gehrke' research interests are in the areas of data mining, data stream processing, and data privacy. In his current research, Gehrke's group is building a scalable system for stateful publish-subscribe of XML data streams for enterprise wide information management. He is also working on data privacy with a focus on privacy-preserving data mining and on techniques for publishing data while controlling the amount of information that is released.

His data mining research focuses on novel data mining algorithms, and his group has developed some of the fastest known algorithms for several important data mining tasks. He is also collaborating with several groups of scientists across campus to solve their data management and data mining problems. Gehrke has given courses and tutorials on data mining and data stream processing at international conferences and on Wall Street, and he has extensive industry experience as technical advisor and consultant.


HPC Home  |  Contact Us  |  Register


© 2004-2006 Reed Business Information a division of Reed Elsevier Inc. All rights reserved. | Use of this website is subject to its Terms of Use and Privacy Policy