Hacker News Links Analysis

by Szymon Lipiński

I made an analysis of the Hacker News data. My main interest was about the URLs, which appeared on the website.

Here is a short explanation of the process. Links for the analysis are at the end.

Downloading The Data

For downloading the data from the Algolia API, I used my nice data downloader https://github.com/szymonlipinski/hackernews_dowloader. There was a couple of issues to solve in this project, you can read more in the Readme file.

When the script stopped, I had about 9.7GB of CSV files with all the Hacker News content.

Loading The Data

For the analysis, I loaded all the data into a PostgreSQL database. There was just too much data to load into memory using Pandas or R’s data frame. My computer has “only” 32GB of RAM. Using swap, because there was not enough memory, made the loading even slower. And I haven’t even started processing the data, which usually makes the memory requirement seven bigger.

When loading the data, I had to assume that the entries are duplicated in the input files (due to some Algolia API limitation, check the downloader’s Readme file). The simplest way to load the files was to use the psql’s \copy command. After loading, I had to run select distinct to remove the duplicates. The speed was pretty nice.

I had CSV files with about 500MB each, loading them to PostgreSQL took about 6s for each.

Preprocessing The Data

My main idea about the analysis was to make some stats about the links. For preprocessing I wanted to make a nicely indexed table with all the URLs.

I chose materialized views, as they can be easily recalculated (if needed). Also, they don’t change the original data. They are also stored on disk, so I wouldn’t need to repeat the calculation again and again. They can also have indices, so the queries can be even faster.

I made two views:

The view with dates was needed as for each of the Hacker News entries I had only a integer with the number of seconds since the 1970-01-01.

The view with all the URLs was made in about 5 minutes, so it was still quite fast.

PostgreSQL Table Sizes

The table with content of all the csv files loaded to Postgres has about 9.5GB.

The materialized view with dates has 2GB, and the view with the urls about 1GB.

Of course, there were also indices.

Selecting And Grouping URLs

With these tables and views, getting the 10 most mentioned URLs was pretty simple, just like:

      SELECT url_domain, count(*)
      FROM urls
      GROUP BY url_domain
      ORDER BY count DESC
      LIMIT 10

The Analysis

The analysis was made in R with RMarkdown. I also tried using Jupyter Lab, but the generated reports were not as I liked. Especially conditional hiding chunks was not easy (I know, there is a plugin, but I couldn’t make it work).

The Reports

I made two reports:

What I Learned and Liked

RMarkdown is really nice. The things I liked most:

PostgreSQL is the best, this is not surprising at all:

What I Could Do Better

There is a couple of things, which could be improved. I’m not going to fix them now, but it’s always something to make better in the future.