Google Play Apps Data and Reviews Dataset

For the last couple of weeks I am running a scraper to crawl and scrap apps data and app reviews from Google Play. My initial aim was to build a list of all apps in Google Play. While I am still holding on to that goal, I have added some more in the list. And the result is an increasing database of apps data and reviews.

Dataset Information
The dataset is a work in progress and is divided in two parts – the app information subset and the app review subset.
The app information subset is still messy and needs some work before I can announce how many apps I got in the list so far. But one thing I can say, there are still more than 800k app pages on the crawl schedule (many of them are probably duplicates; a problem my crawler fails to handle well).
The app review subset is something I just started working on and there are some puzzles to figure out. For example, I still haven’t figured out how to get all reviews for an app (something Google has a limit on).
The app information subset is being built using Scrapy. The initial datasets were saved in CSV files, but I later moved to a PostgreSQL database. For the app review subset I am using that same database, but running a custom script to scrap the reviews.

Data Format
The app information subset scraps the following data.
app_id: App’s ID on Google Play
item_name: Display name of app
updated: Date of last update
author: Name of publisher
filesize: File size of app
downloads: Download numbers for app
version: Version number of app
compatibility: Android version compatibility of app
content_rating: Maturity rating of content
author_link: Publisher website link and/or email address
genre: Category under which app is published
price: Price of app
rating_value: User rating value
review_number : Number of user reviews
description: App description
iap: In-app purchase available or not
developer_badge: Google Developer badge (if any)
physical_address: Contact information of publisher
video_url: Video preview URL
developer_id: Publisher portfolio link in Google Play

The app review subset scraps the following data.
review_id: ID assigned to a review in the dataset
review_date: Date the review was submitted
review_text: Review by app user
rating: Rating given by reviewer
app_id: Google Play app ID

This dataset is still a work in progress and of multiple gigabytes of data. Hence, I have not shared it yet. If you want to get hold of the dataset now, please contact me at manoj@alo.ventures.

And for the enthusiasts, I have decided to share the scripts used in crawling Google Play (it’s still not available on my Github account) and a tiny glimpse of the dataset below.

Loading Facebook Comments ...

3 Comments

  • Pingback: Alo Ventures | 10 Google Play Stats You Should Know

  • uuvvTT June 4, 2015 at 10:51 am Reply

    Thanks for your sharing , our team are trying to use your code.
    We’d like to scraw the comments on google play, but fails to use the unofficial google market api,
    and googled some projects on github, but they are fail to scrawl the reviews.
    I’ll try to use your code.

    • Manoj Pravakar Saha June 4, 2015 at 11:03 am Reply

      You’re welcome. Please remember that my script will only scrap the first 40 reviews shown by Google play in the location of the crawler. To scrap more reviews you’ll have to make the code Ajax crawlable. If you succeed, please share the code in public domain. And good luck for your project!

Leave a Reply