The goal of the Data Science Incubator is to enable new science by bringing together data scientists and domain scientists to work on focused, intensive, collaborative projects. Projects frequently, but not exclusively, involve a non-trivial software engineering component. Our team of data scientists can provide expertise in state-of-the-art technology and methods in large-scale data manipulation and analytics (e.g., Hadoop, GraphLab, Myria, SciDB), cloud and cluster computing, statistics and machine learning, and visualization to help researchers extract knowledge from large, complex, and noisy datasets.
Overview
To apply to the program, any faculty, research staff, or student (typically, but not exclusively, at UW) can submit a short project proposal (details below) describing the science goals, the relevant datasets, and the expected technical challenges.
Each project must include a project lead who is willing to physically co-locate with the incubator staff. We find that collaboration in a shared space is important for deeper technical engagement and provides opportunities for "cross-pollination" among multiple concurrent projects. For Winter 2016, the incubator will operate on Tuesdays and Thursdays, and the project lead should plan to be available for several hours on these days. The program will operate out of the WRF Data Science Studio.
Incubator projects are not "for-hire" software jobs -- each project will
be led by representatives of the applicant's team working in collaboration
with the data scientists and the broader eScience community.
Areas of Focus
Each project will be different, but we emphasize projects in the following categories:
- Scalable Analytics:
- As data sizes continue to explode, parallel methods have become critical at every step. Scripts in Python and R are not natively parallel and are difficult to apply to datasets larger than main memory. Our team can help triage your problem and adapt it for use with parallel data manipulation and machine learning platforms such as Hadoop/MapReduce, parallel SQL databases, GraphLab, SciDB, and advanced research systems such as UW's own Myria. We also design and implement new parallel algorithms for large datasets independent of existing platforms.
- Data Management and Automation:
- Our collaborators report spending 90% of their time "handling" data as opposed to analyzing data: data discovery, acquisition, file format conversions, cleaning, restructuring, loading, sharing, etc. Leveraging technology from cloud providers and SQLShare, we aim to simplify or eliminate these data manipulation tasks and let researchers focus on the science.
- Visualization:
- We have experience building data-driven visualizations to help scientists make sense of data. We focus on web-enabled, interactive visualizations using platforms like D3.
- Reproducibility and Open Science:
- We can help you share your code, data, and results with collaborators and with the general public. We favor projects that emphasize open data and open source, allowing other researchers to recreate your results with minimal effort. We advocate alternative metrics and can help you maximize recognition and credit for ensuring reproducibility and open access. Suitable incubator projects may include organizing and uploading data into suitable public repositories, reviewing and publishing your code on GitHub, identifying venues for publishing papers describing your data or code (they exist!), or migrating your application to a commercial cloud to improve access.
We structure our work according to
agile methodologies, typically breaking large projects into multiple short-term sprints of a few weeks each.
Success Stories
Our team has a strong track record of building systems that get real use. Below are listed some of our previous collaborations.
In the Spring 2014 pilot, we accepted 6 proposals from 5 different departments around campus led by students, postdocs, research staff, and faculty.
You can review the
full list of projects from Spring 2014.
How to Get Started
Incubator proposals should contain the following information:
- Contact information for the project lead -- the one who will join us in the studio and be responsible for carrying out the project.
- Project summary / objective (1 page).
- A description of your data. At least the size, formats, where the data currently resides, and any privacy and access restrictions. We strongly favor projects that have already collected the relevant data rather than "preparatory" projects that involve building software in the anticipation of future data collection activities.
- A list of the key science questions the data will help answer, and a discussion of the publications that you anticipate resulting.
- A list of key technical challenges you face in answering these questions: Do you need new methods or algorithms? Do you need to scale up existing methods? Do you need to integrate data so it can be analyzed? Do you need to publish data and/or code to improve collaborative opportunities and reproducibility?
- The timeframe for your work.
- The names of those researchers who will be physically joining the team to lead the project.
These proposals are prioritized based on the following criteria:
- Good clustering between proposals; ideally, we seek a cohort of proposals with a common theme.
- Alignment with sponsor and program goals
- Participant availability and engagement
- Ability to answer fundamentally new research questions
- Clarity and shovel-readiness
- Capacity for measurable outcomes
- Capabilities of the incubator staff
We expect that some good proposals will not meet every criteria.
Important Dates for the Winter 2016 Session
-
November 9th: Information meeting.
Location: WRF Data Science Studio.
Time: 11am.
- November 23rd: Applications due.
- December 4th: Notification or proposal selections.
-
January 5th: Kickoff meeting.
Location: WRF Data Science Studio.
Additional Info
You can learn more by reviewing
the slides from the information session from February 2014.
In addition, you can review some
frequently asked questions.