11:00 am
Case Study Panel Discussion
|
Data-Driven Science and Cyberinfrastructure
Live Q&A session will
immediately follow
Scientific instruments that transform phenomena
in the physical world into digital data and
computer simulations of scientific experiments
have created a data-driven revolution in the
sciences. Scientists have moved from data-starved
environments to conditions where research communities
are drowning in data. An ongoing census of all
pulsars in the Milky Way at the Arecibo Observatory
will create about 1 petabyte of data over the
next four years. The Cornell Computational Agriculture
Initiative is developing high-resolution climate
data on a daily basis. Support for these projects
will require a comprehensive cyberinfrastructure.
Panelists will discuss how high performance
computing is essential to their research, with
specific examples from data-driven projects
such as the international pulsar project, or
PALFA Consortium, based on data collected from
the Arecibo Observatory, and the Cornell Computational
Agriculture Initiative, as well as how the Cornell
Theory Center is developing, implementing and
maintaining a cyberinfrastructure to support
their research and the plethora of large data
sets it generates.
Discussion Topics/Panelists:
The International Pulsar
Project
James Cordes, Ph.D.
Professor of Astronomy
Cornell University
The Arecibo Observatory, the world's largest
and most sensitive single-dish radio/radar telescope
is managed by the National Astronomy and Ionosphere
Center (NAIC) at Cornell. Arecibo provides state-of-the-art
observing facilities for scientists in radio
astronomy, solar system radar astronomy and
atmospheric studies. The volume of information
being gathered in astronomy today is estimated
to be doubling every 1.5 years or so. This huge
growth in data volume is accompanied by a great
increase in data complexity. Cornell astronomers,
along with a consortia of national and international
researchers, use the Arecibo telescope to conduct
data-intensive surveys. These surveys will produce
thousands of terabytes of data. Arecibo data
and refined data products will be a unique resource
for years to come, providing synergistic opportunities
with other large-scale surveys that have been
done and with telescopes of the future, including
the Gamma-ray Large Area Space Telescope, to
be launched in 2007. Access to astronomical
data at the Cornell Theory Center will be done
in accordance with virtual observatory methods
that are now being developed.
James Cordes's research
interests include radio astronomy, neutron stars,
pulsars, the interstellar medium, the search
for extraterrestrial intelligence, signal processing
techniques, statistical inference, and topics
in computer science. He regularly makes observations
using radio telescopes in Arecibo, Puerto Rico,
the Very Large Array in New Mexico, the Parkes
telescope in Australia, and the Very Long Baseline
Array, headquartered in New Mexico. Cordes also
makes infrared and optical observations using
the Hale Telescope at Palomar and has taken
part in joint radio and gamma-ray observations
using the Compton Gamma-ray Observatory and
X-ray Timing Explorer. He also uses the Hubble
Space Telescope and the Chandra X-ray Satellite
in his multiwavelength work.
He is currently planning observations using
the upgraded Arecibo Observatory and a new multiple-feed
receiver system that involve deep searches for
radio pulsars. Cordes is also heavily involved
in the Square Kilometer Array project, a next-generation
radio telescope.
Computational Agriculture
Harold Mathijs Van Es, Ph.D.
Professor, Crop and Soil Sciences
Cornell University
This program involves a collaborative effort
between the Cornell Theory Center (CTC) and
the College of Agriculture and Life Sciences
(CALS). CTC offers support and training in the
use of HPC technologies, encourage collaboration
as a group to move into a new era in agricultural
science research, and quickly bring the results
of research to farmers and the general public.
With this, HPC necessarily goes beyond fast
data processing and includes effective methods
for data warehousing and querying, and the next-generation
user interface that allows for effective access
of HPC facilities from remote locations. Notably,
this initiative takes advantage of CTC's infrastructure
for Microsoft SQL applications, which allows
for the rapid access of large databases and
integration with dynamic simulation modeling
efforts. Recently-developed methods for the
development of high-resolution climate now require
terabytes of data storage, which need to be
available for rapid access by dynamic models
and also be available forcontinuous data mining.
The dynamic models, in turn, require HPC facilities
to process extensive multi-year simulations
for probabilistic assessments of agricultural-environmental
processes. In addition, several component projects
will take advantage of the integrated efforts
in GIS-based Web interfaces and data representation.
Harold van Es, Professor
of Soil and Water Management, joined Cornell
in 1988. He has extension, research and teaching
responsibilities related to the management of
soil and water resources. He serves as Director
of Graduate Studies and is the lead PI for the
Cornell Initiative on Computational Agriculture.
He is a Fellow of the Soil Science Society of
America and also served as chair of Division
S-6 (Soil and Water Management). He was an advisory
member to the New York State Soil and Water
Conservation Committee, and member of EPA Science
Review Boards for FQPA and FIFRA implementation.
He has authored over 60 refereed journal articles,
co-authored the best-selling book Building Soils
for Better Crops, and served as major advisor
for 18 graduate students.
Microbial Ecology
John Bunge, Ph.D.
Associate Professor, Department of Statistical
Science
Cornell University
Technology and parallel processing have the
potential to play significant roles in understanding
our environment. The Earth’s biodiversity
is immense: science is aware of several million
living species, but there are many more that
have not yet been identified. We are developing
innovative methods to help solve a tricky problem:
how many classes or species should we expect
to find in a population, given incomplete information?
Because it’s not practical to count every
organism, researchers sample the population
in a certain place and time. By chance, however,
members of some species can be missed altogether.
To compensate for this, statisticians look at
and plot the number of times each species was
seen and construct a graphical curve that can
be extrapolated back to estimate how many species
were observed zero times—that is, the
ones completely missed by sampling. In most
cases, we are working with populations that
contain hundreds or thousands of different species.
Some of them appear in the sample many times.
Others, though, appear very infrequently. When
researchers find that many species are seen
only once, or very infrequently, the logic follows
that there are other species the researchers
didn’t see at all. Parallel processing
uses many different computer processors to work
on different parts of a problem simultaneously
and is one way of increasing the speed at which
computers can work, with each processor working
on a part of a larger problem. Using the Cornell
Theory Center’s parallel processing clusters,
Bunge reduced processing time by more than a
factor of four thus shortening the time to discovery.
John Bunge holds a
Ph.D. in mathematical statistics from The Ohio
State University, in addition to degrees in
French literature, music, and philosophy. He
has been interested in the “species problem”
since the mid-1980s, and wrote a review of the
subject in 1993 which is now a basic reference.
In the past 2 years, along with several graduate
students, he has been collaborating with microbial
ecologists on the problem of estimating the
diversity of microbial populations, especially
in the world’s oceans; this is a particularly
challenging problem because 99 percent of microbial
organisms cannot be cultured in the laboratory.
This research relies heavily on high-performance
computing, and as data flows into the project
from an ever-increasing number of sources (46
sampling sites at last count), the mathematical,
statistical and computational demands increase
accordingly.
Moderator:
Johannes Gehrke, Ph.D.
Associate Professor, Dept. of Computer Science
Associate Director, Cornell Theory Center
Cornell University
Johannes Gehrke received
his Ph.D. in computer science from the University
of Wisconsin Madison in 1999. Gehrke' research
interests are in the areas of data mining, data
stream processing, and data privacy. In his
current research, Gehrke's group is building
a scalable system for stateful publish-subscribe
of XML data streams for enterprise wide information
management. He is also working on data privacy
with a focus on privacy-preserving data mining
and on techniques for publishing data while
controlling the amount of information that is
released.
His data mining research focuses on novel data
mining algorithms, and his group has developed
some of the fastest known algorithms for several
important data mining tasks. He is also collaborating
with several groups of scientists across campus
to solve their data management and data mining
problems. Gehrke has given courses and tutorials
on data mining and data stream processing at
international conferences and on Wall Street,
and he has extensive industry experience as
technical advisor and consultant.
|