นักวิทยาศาสตร์ข้อมูล (Data Scientist)

ศูนย์รวม Public Dataset สำหรับ Data Science Project

วันนี้จะพาไปดูแหล่งศูนย์รวม Open Data สำหรับลองทำ Data Science Project กันครับ

dbf56a72-14d2-4b98-b8fd-ff3cd4ef8169

Economic Data:

  1. Publically Traded Market Data: Quandl is an amazing source of finance data. Google Finance and Yahoo Finance are additional good sources of data. Corporate filings with the SEC are available on Edgar.
  2. Housing Price Data: You can use the Trulia API or the Zillow API.
  3. Lending data: You can find student loan defaults by university and the complete collection of peer-to-peer loans from Lending Cluband Prosper, the two largest platforms in the space.
  4. Home mortgage data: There is data made available by the Home Mortgage Disclosure Act and there’s a lot of data from the Federal Housing Finance Agency available here.

Content Data:

  1. Review Content: You can get reviews of restaurant and physical venues from Foursquare and Yelp (see geodata). Amazon has a large repository of Product Reviews. Beer reviews from Beer Advocate can be found here. Rotten Tomatoes Movie Reviews are available from Kaggle.
  2. Web Content: Looking for web content? Wikipedia providesdumps of their articles. Common Crawl has a large corpus of the internet available. ArXiv maintains all their data available via Bulk Download from AWS S3. Want to know which URLs are malicious? There’s a dataset for that. Music data is available from the Million Songs Database. You can analyze the Q&A patterns on sites likeStack Exchange (including Stack Overflow).
  3. Media Data: There’s open annotated articles form the New York Times, Reuters Dataset, and GDELT project (a consolidation of many different news sources). Google Books has published NGrams for books going back to past 1800.
  4. Communications Data: There’s access to public messages of theApache Software Foundation and communications amongst former execs Enron

Government Data:

  1. Municipal Data: Crime Data is available for City of Chicago, andWashington DC. Restaurant Inspection Data is available forChicago and New York City.
  2. Transportation Data: NYC Taxi Trips in 2013 are available courtesy of the Freedom of Information Act. There’s bikesharing data from NYC, Washington DC, and SF. There’s also Flight Delay Data from the FAA
  3. Census Data: Japanese Census Data. US Census data from 2010,2000, 1990. From census data, the government has also derivedtime use data. EU Census Data. Checkout popular male / female baby names going back to the 19th Century from the Social Security Administration.
  4. World Bank: they have a lot of data available on their website.
  5. Election Data: Political contribution data for the last few US elections can be downloaded from the FEC here and here. Polling data is available from Real Clear Politics.

Data With a Cause:

  1. Environmental Data: Data on household energy usage is availableas well as NASA Climate Data.
  2. Medical and biological Data: You can get anything fromanonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.

Miscellaneous:

  1. Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. TheFoursquare API is another good source. Open Street Map has open data on venues as well.
  2. Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
  3. Games Data: Datasets for games, including a large dataset ofPoker hands, dataset of online Domion Games, and datasets ofChess Games are available.
  4. Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.

Metasources: these are great sources for other web pages.

  1. Stanford Network Data: http://snap.stanford.edu/index.html
  2. Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
  3. UCI maintains archives of data for machine learning.
  4. US Census Data
  5. Amazon is hosting Public Datasets on s3
  6. Kaggle hosts machine-learning challenges and many of their datasets are publicly available
  7. The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
  8. Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
  9. BigML is a blog that maintains a list of public datasets for the machine learning community.
  10. Finally, if there’s a website with data you are interested in, crawl for it!

 

แหล่งอ้างอิง 

Chakkrit Tantithamthavorn

Data Science Technology Evangelist at datascience.in.th
I'm a data science lover. I believe that data science could bring a huge benefit to your organization. I'm a hard problem solver. I'm crazy and expert in statistical modelling. I always talk with R and Python. I'm hungry in data. Check out my full CV at http://chakkrit.com
Loading Facebook Comments ...

Leave a Reply