Hacker News Links Analysis

published at 15 Sep, 2019 by Szymon Lipiński tags: programming python R

I made an analysis of the Hacker News data. My main interest was about the URLs, which appeared on the website.

Here is a short explanation of the process. Links for the analysis are at the end.

Downloading The Data

For downloading the data from the Algolia API, I used my nice data downloader https://github.com/szymonlipinski/hackernews_dowloader. There was a couple of issues to solve in this project, you can read more in the Readme file.

When the script stopped, I had about 9.7GB of CSV files with all the Hacker News content.

Loading The Data

For the analysis, I loaded all the data into a PostgreSQL database. There was just too much data to load into memory using Pandas or R’s data frame. My computer has “only” 32GB of RAM. Using swap, because there was not enough memory, made the loading even slower. And I haven’t even started processing the data, which usually makes the memory requirement seven bigger.

When loading the data, I had to assume that the entries are duplicated in the input files (due to some Algolia API limitation, check the downloader’s Readme file). The simplest way to load the files was to use the psql’s \copy command. After loading, I had to run select distinct to remove the duplicates. The speed was pretty nice.

I had CSV files with about 500MB each, loading them to PostgreSQL took about 6s for each.

Preprocessing The Data

My main idea about the analysis was to make some stats about the links. For preprocessing I wanted to make a nicely indexed table with all the URLs.

I chose materialized views, as they can be easily recalculated (if needed). Also, they don’t change the original data. They are also stored on disk, so I wouldn’t need to repeat the calculation again and again. They can also have indices, so the queries can be even faster.

I made two views:

  • with dates like year, month, day of week, etc.
  • with URLs extracted from the entries fields like title or comment

The view with dates was needed as for each of the Hacker News entries I had only a integer with the number of seconds since the 1970-01-01.

The view with all the URLs was made in about 5 minutes, so it was still quite fast.

PostgreSQL Table Sizes

The table with content of all the csv files loaded to Postgres has about 9.5GB.

The materialized view with dates has 2GB, and the view with the urls about 1GB.

Of course, there were also indices.

Selecting And Grouping URLs

With these tables and views, getting the 10 most mentioned URLs was pretty simple, just like:

      SELECT url_domain, count(*)
      FROM urls
      GROUP BY url_domain
      ORDER BY count DESC
      LIMIT 10
      

The Analysis

The analysis was made in R with RMarkdown. I also tried using Jupyter Lab, but the generated reports were not as I liked. Especially conditional hiding chunks was not easy (I know, there is a plugin, but I couldn’t make it work).

The Reports

I made two reports:

  • The simple version, with only the results.
  • The full version, with all the code used for the analysis.

What I Learned and Liked

RMarkdown is really nice. The things I liked most:

  • the chunks can be cached on disk
  • the chunks can be optionally rendered
  • other rmarkdown documents can be included in a document (also optionally)
  • rstudio has great support for getting the help, installing libraries, viewing data frames content
  • the html output can be nicely styled with a custom css

PostgreSQL is the best, this is not surprising at all:

  • csv importing speed is amazing
  • the materialized views are a very good way to store preprocessed data for further searching
  • making most of the processing in the database was a huge win, I had no out of memory errors, like I had earlier with Python’s Pandas library, where all data had to be loaded in memory

What I Could Do Better

There is a couple of things, which could be improved. I’m not going to fix them now, but it’s always something to make better in the future.

  • the dates view has too many unused fields
  • I should have better tests for regular expressions used in the SQL queries for extracting the URLs
  • the R code should be refactored to multiple functions for code reusing, as there was some code duplication
  • the github links are only a simple list, there is no official github cards
  • the twitter results can be ugly, as adblock disables the twitter cards