For the last couple of weeks I am running a scraper to crawl and scrap apps data and app reviews from Google Play. My initial aim was to build a list of all apps in Google Play. While I am still holding on to that goal, I have added some more in the list. And the result is an increasing database of apps data and reviews.
The dataset is a work in progress and is divided in two parts – the app information subset and the app review subset.
The app information subset is still messy and needs some work before I can announce how many apps I got in the list so far. But one thing I can say, there are still more than 800k app pages on the crawl schedule (many of them are probably duplicates; a problem my crawler fails to handle well).
The app review subset is something I just started working on and there are some puzzles to figure out. For example, I still haven’t figured out how to get all reviews for an app (something Google has a limit on).
The app information subset is being built using Scrapy. The initial datasets were saved in CSV files, but I later moved to a PostgreSQL database. For the app review subset I am using that same database, but running a custom script to scrap the reviews.
The app information subset scraps the following data.
app_id: App’s ID on Google Play
item_name: Display name of app
updated: Date of last update
author: Name of publisher
filesize: File size of app
downloads: Download numbers for app
version: Version number of app
compatibility: Android version compatibility of app
content_rating: Maturity rating of content
author_link: Publisher website link and/or email address
genre: Category under which app is published
price: Price of app
rating_value: User rating value
review_number : Number of user reviews
description: App description
iap: In-app purchase available or not
developer_badge: Google Developer badge (if any)
physical_address: Contact information of publisher
video_url: Video preview URL
developer_id: Publisher portfolio link in Google Play
The app review subset scraps the following data.
review_id: ID assigned to a review in the dataset
review_date: Date the review was submitted
review_text: Review by app user
rating: Rating given by reviewer
app_id: Google Play app ID
This dataset is still a work in progress and of multiple gigabytes of data. Hence, I have not shared it yet. If you want to get hold of the dataset now, please contact me at firstname.lastname@example.org.
And for the enthusiasts, I have decided to share the scripts used in crawling Google Play (it’s still not available on my Github account) and a tiny glimpse of the dataset below.