Stats

Research Habits and Strategies

  • Ask LOTS of questions and regularly examine your assumptions
  • Build a foundational understanding of your topic/problem, familiarize yourself with the context(s) within which it is studied, and learn related terminology
  • When researching how to address a problem, consider both direct and indirect factors and solutions. For instance, increasing access to food assistance like foodbanks could reduce food insecurity, but what factors lead to a family to need food assistance? Could addressing these factors have more impact?
  • Skim relevant literature reviews (found in library databases, PubMed Central, Google Scholar, etc.) to learn what’s known and debated about the topic among scholars
  • To find models, search for your topic/problem plus words like model, index, indicator(s), etc.
  • Consider the potential bias of the statistics-gathering or model-creating agency
  • Strategies for finding datasets
    • Some experts recommend defining your problem, variables or units of analysis (e.g., college students), time frame, and location of interest first; however, examining available data may also inform these decisions (note: some locations may be statistical, such as block groups)
    • Search/browse relevant data repositories (note: you may be able to search/limit by variable, raw data, etc.; data access may require registration)
    • Search sites of related government agencies, trade/industry groups, NGO’s, research centers or institutes at universities, or other organizations likely to gather data on the topic to see if data sets are available (note: it’s rare for private companies to make data available for free)
    • Find relevant scholarly research in library databases, Google Scholar, open source databases (e.g., PubMed Central), and elsewhere (note: in PMC, you can limit to articles with “Associated Data” as shown below). Check references for leads to datasets. For more tips about leveraging studies to find health/medical datasets, see this guide from Yale Libraries.  You can even search a special database from ICPSR designed to help you “discover data via the literature.”

pubmed_associated_data

  • Strategies for finding statistics (or using stats as an indirect route to finding data)
    • Search for your topic along with words like statistics, data, report, analysis, findings, etc., or possibly surveillance, monitoring, or a unit of analysis related to your topic (e.g., accidents)
    • Few or no results? Try synonyms or related terms for a data point you’re seeking (e.g., fatalities, deaths, mortality rate) as well as broader terms for your topic (e.g., crops instead of corn)
    • Check the sites of government agencies and university research centers or institutes with a stake in your topic (e.g., bicycle share program) for reports, research, publications, or white papers – sometimes under subtopics like education or advocacy – and skim for factors that could become data points (e.g. showers for cyclists in DC). Then check the  references for additional leads.
    • Repeat previous step on sites of NGO’s, trade/industry groups (e.g. American Public Transportation Association), advocacy groups, or special interest groups (e.g. International Bicycle Fund)
    • You may wish limit results by desired file format (e.g. filetype:xls, filetype:pdf)

Cite your data sources. Check this Quick Guide for citing data or the detailed How to Cite Data guide from librarian Hailey Mooney at Michigan State



Data repositories covering many topics

African Information Highway from the African Development Bank Group – data on many topics
Data.gov (U.S. open data)
DataHub.io (from the folks who built the system that runs data.gov)
GSS Data Explorer – General Social Survey (U Chicago)
Github public data sets – data also organized by publisher (e.g., BuzzFeed)
Google Public Data – you might also try their dataset search
Inter-university Consortium for Political and Social Research (ICPSR) – or browse thematic collections; some data has restricted access and may only be available to participating institutions
Kaggle – data hub that’s part of the Medium ecosystem
Scientific Data (VCU LibGuide)
Statista (portal for market and consumer data)
Statistic Brain (tables not exportable)
The World Bank Data Center
Zenodo – open data in science
Research Tip: If applicable, look for options to search by variable, mathmatical method, etc.
Research Tip: If a site with historical data is no longer active, you can sometimes find an archival copy on the Internet Archive’s Wayback Machine or in government archives, such as this BSE Inquiry site archived by the UK’s National Archives. Since these site copies are not maintained, some links may no longer work. If a site is just temporarily down, you may be able to click the down-arrow next to its title in your Google search results to view the cached page.

Agriculture

Research Tip: Investigating a food safety issue that a government agency may be tracking? Along with the disease or contaminant, search for words like surveillance or monitoring. Sample result: BSE Surveillance Information Center (USDA). Such sites may only highlight key points from a more detailed plan; if you find the underlying plan, the references may provide additional leads.

Crime


Demographics

American Community Survey (ACS) – annual survey; see questions and why each is asked; searchable by 1-yr, 3-yr, and 5-yr estimates – shorter period includes fewer geo areas
Census Data portal
Immigration data (U.S. Census Bureau)
Kids Count (uses U.S. Census data)
Migration flows state-by-state (U.S. Census Bureau)
Tax Stats (IRS)
World Population Estimates (U.S. Census Bureau)

Education and Health

Food Environment Atlas (USDA) – food insecurity, cost, etc.
GSS Data Explorer – General Social Survey (U Chicago)
Performance Monitoring for Action (PMA) – open data about 9 countries in Africa and Asia, including data about water, sanitation, and hygiene (WASH) indicators in Nigeria – registration required
Project TYCHO (UPitt) – public health data – req. registration
Zenodo – open data in science
Research tip: A solid foundational understanding of your problem/topic, including recon about what data sources others tackling the problem are using can inform your solutions and your data-finding strategies (e.g., Data sources section on p.35 of this global report: A New Model for Water Access)

Energy and Environment

EarthData (NASA)
Intergovernmental Oceanic Commission of UNESCO – global (CTRL+F to skim list for variable)
Oceanic Data (NOAA)
Performance Monitoring for Action (PMA) – open data about 9 countries in Africa and Asia, including data about water, sanitation, and hygiene (WASH) indicators in Nigeria – registration required

Labor and Trade

Exports.gov – incl. state data
International Trade for U.S. State and Metropolitan Areas
Quandl – financial data sets (set filter to free to avoid for-fee data)
Trade and Tariff data (USITC)
Labor Stats – U.S. (BLS) – incl. regional and state data
Research Tip: For industry-specific data, check sites of trade/industry organization sites for sections labeled research, publications, reports, or white papers, possibly under headings like education or advocacy. Sample paper: Myths and Statistics from OOIDA (Owner-Operator Independent Drivers Association), a trucking-related organization

Politics and Social Science

ASEP/JDS – intl social science
Global Indicators (PEW) – opinion of US and of China, confidence in American president
GSS Data Explorer – General Social Survey (U Chicago)
Human Rights Data Analysis Group – Available datasets
Public Opinion (VCU LibGuide)
U.S. Politics Data Sets (PEW) – req. registration for downloads
World Values Survey (req. registration for downloads)

Sports

Sports Statistcs Guide (Drexel U Libraries)
Statistics in Sports (ASA) – some links are outdated but others OK
Yahoo! Sports (only male stats)

Data Visualization


Data Blogs and Podcasts

Andrew Gelman (Columbia)
Data Skeptic (podcast)
The Guardian‘s (UK) Datablog
Women in Data Science (podcast – Stanford)

Code Sharing

Stack Overflow (for coding Q&A)

Resources for R and Git

Cookbook for R (Winston Chang)
Git Cheat Sheet (from GitHub)
Quick-R (Rob Kabacoff)

How to retrieve block level data from the ACS

The American Community Survey is one of the richest data sets compiled by the U.S. Census Bureau and is used by local governments, emergency services, and non-profit organizations to anticipate community needs. The most granular level of data is by Census Tract or Block Group. The tutorial below demonstrates how to use the Summary File Retrieval Tool (in conjunction with the Tech Document related to the Summary File you’re using) to retrieve this CSV formatted data using Excel 2007 (or higher).
%d bloggers like this: