The data is downloaded to a set of csv file using the code available at https://github.com/szymonlipinski/hackernews_dowloader.
This made the following files:
## 504M /home/data/hn/1554994838_1565866358__19635134_20704074.data.csv
## 485M /home/data/hn/1401111353_1421029195__994369_8872196.data.csv
## 467M /home/data/hn/1300452676_1326824346__2340190_3475825.data.csv
## 450M /home/data/hn/1209976952_1271054983__181299_1258580.data.csv
## 506M /home/data/hn/1543560409_1554994838__18567251_19635134.data.csv
## 505M /home/data/hn/1556804358_1567598222__19807762_20876231.data.csv
## 493M /home/data/hn/1454584884_1468706219__11033266_12108033.data.csv
## 505M /home/data/hn/1531176188_1543560409__17494018_18567251.data.csv
## 490M /home/data/hn/1421029195_1437923693__8872196_9951246.data.csv
## 499M /home/data/hn/1493867970_1506462121__14262172_15342765.data.csv
## 465M /home/data/hn/1271054983_1300452676__1258580_2340190.data.csv
## 478M /home/data/hn/1367793791_1384391674__5660073_6729887.data.csv
## 82M /home/data/hn/1554994838_1556804358__19635134_19807762.data.csv
## 505M /home/data/hn/1519132696_1531176188__16420052_17494018.data.csv
## 474M /home/data/hn/1326824346_1348853030__3475825_4586677.data.csv
## 498M /home/data/hn/1468706219_1481804010__12108033_13184043.data.csv
## 502M /home/data/hn/1481804010_1493867970__13184043_14262172.data.csv
## 502M /home/data/hn/1506462121_1519132696__15342765_16420053.data.csv
## 476M /home/data/hn/1348853030_1367793791__4586677_5660073.data.csv
## 498M /home/data/hn/1437923693_1454584884__9951246_11033266.data.csv
## 69M /home/data/hn/1160418111_1209976952__1_181299.data.csv
## 478M /home/data/hn/1384391674_1401111353__6729887_7799657.data.csv
## 9,7G /home/data/hn
All the data is too large to keep it in R in memory for processing on my machine. An alternative is to keep it in a database, I chose PostgreSQL.
The table structure for the csv data is:
##
## CREATE TABLE raw_data (
## title TEXT,
## url TEXT,
## author TEXT,
## points INT,
## story_text TEXT,
## comment_text TEXT,
## num_comments INT,
## story_id INT,
## story_title TEXT,
## story_url TEXT,
## parent_id INT,
## created_at_i INT,
## type TEXT,
## object_id INT
## );
All the files have been loaded with:
## #!/bin/bash
##
##
## if (( $# != 4 )); then
## echo "Loads data from csv files to a postgres database"
## echo "USAGE:"
## echo "./load_files.sh DBNAME DBUSER TABLE_NAME FILES_DIRECTORY"
## exit 0
## fi
##
## DBNAME=$1
## DBUSER=$2
## TABLE_NAME=$3
## FILES_DIRECTORY=$4
##
## for f in $FILES_DIRECTORY/*.csv
## do
## echo "Loading $f"
## psql $DBNAME -U $DBUSER -c "\\COPY $TABLE_NAME FROM $f WITH CSV DELIMITER ',' HEADER "
## done
The loading time was about 6s per file.
According to the documentation of the downloader program:
Some entries in the files are duplicated, which is basically because of the Algolia API limitations. What's more, Hackernews users can edit their entries, so when downloading the data after some time, some entries may be different. Mechanism of loading the data to a processing pipeline should update the entries when will have a duplicated entry id.
To remove the duplicates, I used a simple query which should create a new table without the duplicated rows. The primary key for the data is the object_id
column, so to make things faster, I created an index, and used distinct on
:
## BEGIN;
##
## CREATE INDEX i_raw_data_object_id ON raw_data (object_id);
##
## CREATE TABLE data AS
## SELECT DISTINCT ON (object_id) *
## FROM raw_data;
##
## DROP TABLE raw_data;
##
## COMMIT;
I also need some indices on the data table for faster searching. I omitted the text columns, except for the ones where I will use the whole text to search, like type = 'comment'
.
## CREATE INDEX i_data_author ON data (author);
## CREATE INDEX i_data_points ON data (points);
## CREATE INDEX i_data_num_comments ON data (num_comments);
## CREATE INDEX i_data_story_id ON data (story_id);
## CREATE INDEX i_data_parent_id ON data (parent_id);
## CREATE INDEX i_data_created_at_i ON data (created_at_i);
## CREATE INDEX i_data_type ON data (type);
## CREATE INDEX i_data_object_id ON data (object_id);
In the further data processing, I will need to repeat some data operations. To speed it up, I will calculate a couple of things and store it in the database. I like to use materialized views for this for two reasons:
The only date field in the data table is the created_at_i
which is an integer with number of seconds since the Jan 1st, 1970. As I will need to aggregate dates by weeks, days of week, months, years, to decrease the query time later, I will calculate it now:
## create materialized view dates as
## select
## object_id,
## timestamp 'epoch' + created_at_i * interval '1 second' as date,
## date_part('year', timestamp 'epoch' + created_at_i * interval '1 second') as year,
## date_part('month', timestamp 'epoch' + created_at_i * interval '1 second') as month,
## date_part('week', timestamp 'epoch' + created_at_i * interval '1 second') as week,
## date_part('day', timestamp 'epoch' + created_at_i * interval '1 second') as day,
## date_part('dow', timestamp 'epoch' + created_at_i * interval '1 second') as dow,
## date_part('hour', timestamp 'epoch' + created_at_i * interval '1 second') as hour,
## date_part('minute', timestamp 'epoch' + created_at_i * interval '1 second') as minute,
## date_part('second', timestamp 'epoch' + created_at_i * interval '1 second') as second,
## to_char(timestamp 'epoch' + created_at_i * interval '1 second', 'yyyy-MM') as year_month
## from data;
For faster searching, I will add some indices on the above view:
## create index i_dates_object_id on dates(object_id);
## create index i_dates_year on dates(year);
## create index i_dates_month on dates(month);
## create index i_dates_date on dates(date);
I will also get all the urls from the specific fields. For now I will mark the source of the url, as it is possible that the urls distribution in stories text is different than in comments.
## -- The urls can be everywhere
## -- If the entry type is a story, then it has fields like: title, url
## -- If it's a comment, then it has comment_text, story_title, story_url
## -- Jobs can have url, title, and story_text
## create materialized view
## urls as
## with url_data as
## (
## select
## distinct
## object_id, 'comment_text' as type,
## unnest(
## regexp_matches(comment_text, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)', 'gi')
## ) url
## from data
## UNION ALL
## select
## distinct
## object_id, 'story_title',
## unnest(
## regexp_matches(title, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)', 'gi')
## ) url
## from data
## UNION ALL
## select
## distinct
## object_id, 'story_text',
## unnest(
## regexp_matches(story_text, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)', 'gi')
## ) url
## from data
## UNION ALL
## select
## distinct
## object_id, 'url',
## unnest(
## regexp_matches(url, '((?:http|https)://[a-zA-Z0-9][a-zA-Z0-9\.-]*\.[a-zA-Z]{2,}/?[^\s<"]*)', 'gi')
## ) url
## from data
## ),
## clean_urls as (
## SELECT DISTINCT
## object_id,
## type,
## case when rtrim(url, './') ilike '%(%)%'
## then rtrim(url, './')
## else rtrim(url, './)')
## end as url
## FROM url_data
## WHERE url not like '%...'
## ),
## parts as (
## SELECT
## object_id, type, url,
##
## (regexp_matches(lower(url), '^(\w*)://[^/]*/?.*/?$')::TEXT[])[1] as protocol,
## (regexp_matches(lower(url), '^\w*://([^/]*)/?.*/?$')::TEXT[])[1] as domain,
## (regexp_matches(lower(url), '^\w*://(?:www.)?([a-zA-Z0-9_\.-]*).*$')::TEXT[])[1] as domain_without_www,
## (regexp_matches(url, '^\w*://[^/]*(/.*)/?$')::TEXT[])[1] as full_path,
## (regexp_matches(url, '^\w*://[^/]*/.*/?\?(.*)/?$')::TEXT[])[1] as params,
## (regexp_matches(url, '^\w*://[^/]*(/[^?#]*?)/?')::TEXT[])[1] as path
## FROM clean_urls
## )
## select
## *
## from parts;
For faster searching, I will add some indices on the above view:
## create index i_urls_object_id on urls(object_id);
## create index i_urls_protocol on urls(protocol);
The main table size with all indices:
require("RPostgreSQL")
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "hn",
host = "localhost", port = 5432,
user = "hn", password = "hn")
tables <- dbGetQuery(con, '
SELECT
tablename "Table Name",
pg_size_pretty(pg_relation_size(tablename::text)) "Size"
FROM
pg_tables where schemaname=\'public\'
ORDER BY
tablename
')
views <- dbGetQuery(con, '
SELECT
matviewname "View Name",
pg_size_pretty(pg_relation_size(matviewname::text)) "Size"
FROM
pg_matviews where schemaname=\'public\'
ORDER BY
matviewname
')
indices <- dbGetQuery(con, '
SELECT
tablename "Table Name",
indexname "Index Name",
pg_size_pretty(pg_relation_size(indexname::text)) "Size"
FROM
pg_indexes
WHERE schemaname = \'public\'
ORDER BY tablename, indexname;
')
## Table Name Size
## 1 data 9764 MB
## 2 urls_old 858 MB
## View Name Size
## 1 dates 2156 MB
## 2 urls 1241 MB
## Table Name Index Name Size
## 1 data i_data_author 505 MB
## 2 data i_data_created_at_i 414 MB
## 3 data i_data_num_comments 414 MB
## 4 data i_data_object_id 414 MB
## 5 data i_data_parent_id 414 MB
## 6 data i_data_points 414 MB
## 7 data i_data_story_id 414 MB
## 8 data i_data_type 414 MB
## 9 dates i_dates_date 414 MB
## 10 dates i_dates_month 414 MB
## 11 dates i_dates_object_id 414 MB
## 12 dates i_dates_year 414 MB
## 13 urls i_urls_object_id 116 MB
## 14 urls i_urls_protocol 116 MB
The URLs are gathered regardless of the field type, we count one URL per object_id. So, if the same URL appears in the same entry in the title and the description, then it’s counted as one. However, if it’s in a story and in a comment to this story, then they are counted as two separate URLs.
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "hn",
host = "localhost", port = 5432,
user = "hn", password = "hn")
urls <- dbGetQuery(con, "
WITH distinct_urls AS (
SELECT DISTINCT object_id, protocol, url
FROM urls
UNION
SELECT DISTINCT object_id, 'all',
domain || '/' || path as url
FROM urls
)
SELECT protocol, year_month date, count(*)
FROM distinct_urls
JOIN dates USING (object_id)
WHERE date < date_trunc('month', now())
GROUP BY protocol, year_month
ORDER BY year_month desc, protocol
")
dateRange <- dbGetQuery(con, "SELECT min(year) min, max(year) max FROM dates")
breaks <- mapply({function(year) sprintf("%s-01", year)},
seq(dateRange$min[1], dateRange$max[1]))
gg <- ggplot(data = urls, aes(x = date, y = count, group = protocol)) +
geom_point(aes(color = protocol)) +
geom_smooth(method = "loess", se = FALSE, aes(color = protocol)) +
labs(title = "Protocol Distribution For URLs By Month",
x = "Date",
y = "Count",
color = "Protocol Type") +
scale_x_discrete(breaks = breaks) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)
The number of https links started growing very fast in 2011. In the middle of 2013, the number of http links started decreasing. What’s more interesting, the http links haven’t disappeared, yet.
entries <- dbGetQuery(con, "
WITH all_entries as (
SELECT year_month date, data.type, count(data.object_id)
FROM data
JOIN dates USING(object_id)
WHERE data.type in ('story', 'comment')
AND date < date_trunc('month', now())
GROUP BY year_month, data.type
),
entries_with_comments as (
SELECT year_month date, data.type, count(data.object_id)
FROM data
JOIN urls USING(object_id)
JOIN dates USING(object_id)
WHERE data.type in ('story', 'comment')
AND date < date_trunc('month', now())
GROUP BY year_month, data.type
)
SELECT e.date, e.type, e.count allCount, c.count commentsCount,
100.0 * c.count / e.count percent
FROM all_entries e
JOIN entries_with_comments c USING (date, type)
ORDER BY e.date, e.type
")
gg <- ggplot(data = entries, aes(x = date, y = allcount, group = type)) +
geom_point(aes(color = type)) +
geom_smooth(method = "loess", se = FALSE, aes(color = type)) +
labs(title = "Number of Entries By Month",
x = "Date",
y = "Count",
color = "Entry Type") +
scale_x_discrete(breaks = breaks) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)
The number of comments for a month is growing almost linearly. The number of stories also grows, but much slower.
gg <- ggplot(data = entries, aes(x = date, y = commentscount, group = type)) +
geom_point(aes(color = type)) +
geom_smooth(method = "loess", se = FALSE, aes(color = type)) +
labs(title = "Entries With URLs By Month",
x = "Date",
y = "Count",
color = "Entry Type") +
scale_x_discrete(breaks = breaks) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)
The number of stories and the number of comments with links is also growing, which can be caused by the growth of the entries.
gg <- ggplot(data = entries, aes(x = date, y = percent, group = type)) +
geom_point(aes(color = type)) +
geom_smooth(method = "loess", se = FALSE, aes(color = type)) +
labs(title = "Percentage of Entries With URLs By Month",
x = "Date",
y = "Percent",
color = "Entry Type") +
scale_x_discrete(breaks = breaks) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)
Almost every story has a link and less than 25% of the comments.
The percentage of the stories and the comments with links is almost the same. It looks like the growth of the entries with links, shown in the previous charts, is caused by the overall growth of the number of entries.
mostActiveDomainsByMonth <- dbGetQuery(con, "
WITH domains AS (
SELECT domain_without_www, count(*)
FROM urls
GROUP BY domain_without_www
ORDER BY count DESC
LIMIT 10
)
SELECT domain_without_www as domain, year_month date, count(*)
FROM urls
JOIN dates USING (object_id)
WHERE date < date_trunc('month', now())
AND domain in (SELECT domain_without_www FROM domains)
GROUP BY domain_without_www, year_month
ORDER BY domain_without_www, year_month
;
")
domainsWithMostPointsByMonth <- dbGetQuery(con, "
WITH domains AS (
SELECT domain_without_www, sum(coalesce(points, 0)) points
FROM urls
JOIN data USING (object_id)
GROUP BY domain_without_www
ORDER BY points DESC
LIMIT 10
)
SELECT domain_without_www as domain, year_month date, sum(coalesce(points, 0)) points
FROM urls
JOIN data USING (object_id)
JOIN dates USING (object_id)
WHERE date < date_trunc('month', now())
AND domain in (SELECT domain_without_www FROM domains)
GROUP BY domain_without_www, year_month
ORDER BY domain_without_www, year_month
;
")
mostActiveDomains <- dbGetQuery(con, "
SELECT domain_without_www, count(*)
FROM urls
GROUP BY domain_without_www
ORDER BY count DESC
LIMIT 10
")
domainsWithMostPoints <- dbGetQuery(con, "
SELECT domain_without_www, sum(coalesce(points, 0)) points
FROM urls
JOIN data USING (object_id)
GROUP BY domain_without_www
ORDER BY points DESC
LIMIT 10
")
gg <- ggplot(data = mostActiveDomainsByMonth, aes(x = date, y = count, group = domain)) +
geom_line(aes(color = domain)) +
labs(title = "Most Mentioned Domains By Month",
x = "Date",
y = "Count",
color = "Domain") +
scale_x_discrete(breaks = breaks) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)
The domains with the biggest number of posts are:
gg <- ggplot(data = domainsWithMostPointsByMonth, aes(x = date, y = points, group = domain)) +
geom_line(aes(color = domain)) +
labs(title = "Domains With Most Points By Month",
x = "Date",
y = "Points",
color = "Domain") +
scale_x_discrete(breaks = breaks) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
plot(gg)
The domains with the biggest number of points all the time are:
getDomainData <- function(con, domain) {
res <- dbGetQuery(con, "
SELECT path, sum(coalesce(points, 0)) point
FROM urls
JOIN data USING (object_id)
WHERE (path IS NOT NULL AND path <> '/')
AND domain_without_www = $1
GROUP BY path
ORDER BY point DESC
LIMIT 20;
", params = c(domain))
res$url <- paste(paste("https://", domain, sep=""), res$path, sep="")
res
}
xkcdMappings <- list(
"/standards.png",
"/heartbleed_explanation.png",
"/machine_learning.png",
"/security.png",
"/survivorship_bias.png",
"/click_and_drag.png",
"/password_strength.png",
"/ten_thousand.png",
"/instagram.png"
)
names(xkcdMappings) <- list(
"/927",
"/1354",
"/1838",
"/538",
"/1827",
"/1110",
"/936",
"/1053",
"/1150"
)
baseXkcdURL <- "https://imgs.xkcd.com/comics"
xkcd <- getDomainData(con, 'xkcd.com')
xkcd$embeddedUrl <- mapply({function(path) {
mappedUrl <- if (!(path %in% names(xkcdMappings))) {
path
} else {
xkcdMappings[path][1]
}
# This is a special case, as this comic is not stored
# at the usual embedded URL
if (path == "/radiation") {
return("https://imgs.xkcd.com/blag/radiation.png")
}
paste(baseXkcdURL, mappedUrl, sep = "")
}
}, xkcd$path)
xkcd$html <- sprintf(
'<span class="list-item"> <span class="image"> <a href="%s" target="<!!>"></a> <span class="link"> %s/ </span> </span> </span>',
xkcd$url,
xkcd$embeddedUrl,
xkcd$url
)
wikiData <- getDomainData(con, 'en.wikipedia.org')
# I need to remove some of the links, as they are not interested for the analysis
# /w/index.php - it's the main wikipedia page
wikiData <- wikiData %>%
filter(!(path %in% c("/w/index.php")))
wikiData <- wikiData[1:10, ]
# download the html pages from the wikipedia for the urls
wikiData$page <- mapply({function(url) url %>% GET() %>% content("text")}, wikiData$url)
# parse the html to get the article title
wikiData$title <- mapply({function(html) read_html(html) %>% html_nodes("#firstHeading") %>% html_text()}, wikiData$page)
# parse the html to get the article first paragraph
wikiData$firstParagraph <- mapply({function(html) {
x <- read_html(html) %>%
html_nodes("#mw-content-text .mw-parser-output p")
x <- Filter({function (t) html_text(t) != "\n"}, x)[1]
html_text(x)
}
}, wikiData$page)
data <- sprintf(
'<p class="list-item"> <span class="title"> <a href="%s" target="<!!>">%s</a> </span> <span class="first-paragraph"> %s </span> <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </p>',
wikiData$url,
wikiData$title,
wikiData$firstParagraph,
wikiData$url,
wikiData$url
)
John McCarthy (computer scientist) John McCarthy (September 4, 1927 – October 24, 2011) was an American computer scientist and cognitive scientist. McCarthy was one of the founders of the discipline of artificial intelligence.[1] He coined the term “artificial intelligence” (AI),[2] developed the Lisp programming language family, significantly influenced the design of the ALGOL programming language, popularized timesharing, and was very influential in the early development of AI. https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)
High-Tech Employee Antitrust Litigation High-Tech Employee Antitrust Litigation is a 2010 United States Department of Justice (DOJ) antitrust action and a 2013 civil class action against several Silicon Valley companies for alleged “no cold call” agreements which restrained the recruitment of high-tech employees. https://en.wikipedia.org/wiki/High-Tech_Employee_Antitrust_Litigation
Timeline of the far future While the future can never be predicted with absolute certainty,[1] present understanding in various scientific fields allows for the prediction of some far-future events, if only in the broadest outline. These fields include astrophysics, which has revealed how planets and stars form, interact, and die; particle physics, which has revealed how matter behaves at the smallest scales; evolutionary biology, which predicts how life will evolve over time; and plate tectonics, which shows how continents shift over millennia. https://en.wikipedia.org/wiki/Timeline_of_the_far_future
Wikipedia:Wikipedia Signpost/2017-02-27/Op-ed In biology, the hallmarks of an aggressive cancer include limitless and exponential multiplication of ordinarily beneficial cells, even when the body signals that further multiplication is no longer needed. The Wikipedia page on the wheat and chessboard problem explains that nothing can keep growing exponentially forever. In biology, the unwanted growth usually terminates with the death of the host. Exponential spending increases can often lead to the same undesirable result in organizations. https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2017-02-27/Op-ed
Great Molasses Flood The Great Molasses Flood, also known as the Boston Molasses Disaster or the Great Boston Molasses Flood, and sometimes referred to locally as the Boston Molassacre,[1][2] occurred on January 15, 1919, in the North End neighborhood of Boston, Massachusetts. A large storage tank burst, filled with 2,300,000 US gal (8,700 m3; 8,706,447 liters)[3] (ca 12,000 tons; 10,886 metric tons; 24,000,000 lbs)[4] of molasses, and a wave of molasses rushed through the streets at an estimated 35 mph (56 km/h), killing 21 and injuring 150.[5] The event entered local folklore and residents claimed for decades afterwards that the area still smelled of molasses on hot summer days.[6][5] https://en.wikipedia.org/wiki/Great_Molasses_Flood
User talk:Jimbo Wales Congratulations, Jimbo Wales! Despite your request for donations by the end of June, you have refunded all donations from a small donor. I am glad that Wikipedia is so successful. 84.120.0.236 (talk) 14:11, 5 August 2019 (UTC) https://en.wikipedia.org/wiki/User_talk:Jimbo_Wales
Hy Hy (alternately, Hylang) is a programming language, a dialect of the language Lisp designed to interact with the language Python by translating expressions into Python’s abstract syntax tree (AST). Hy was introduced at Python Conference (PyCon) 2013 by Paul Tagliamonte.[1] https://en.wikipedia.org/wiki/Hy
Potato paradox The potato paradox is a mathematical calculation that has a counter-intuitive result. The so-called paradox involves dehydrating potatoes by a seemingly minuscule amount, and then calculating a change in mass which is larger than expected. https://en.wikipedia.org/wiki/Potato_paradox
Zero rupee note A zero-rupee note is a banknote imitation issued in India as a means of helping to fight systemic political corruption. The notes are “paid” in protest by angry citizens to government functionaries who solicit bribes in return for services which are supposed to be free. Zero rupee notes, which are made to resemble the regular 50 rupee banknote of India, are the creation of a non-governmental organization known as 5th Pillar which has, since their inception in 2007, distributed over 2.5 million notes as of August 2014. The notes remain in current use and thousands of notes are distributed every month. https://en.wikipedia.org/wiki/Zero_rupee_note
Room 641A Coordinates: 37°47′07″N 122°23′48″W / 37.78528°N 122.39667°W / 37.78528; -122.39667 https://en.wikipedia.org/wiki/Room_641A
# Get the links
arstechnicaData <- getDomainData(con, 'arstechnica.com')
arstechnicaData <- arstechnicaData[1:10, ]
# Download the pages
arstechnicaData$page <- mapply({function(url) url %>% GET() %>% content("text")}, arstechnicaData$url)
# parse the html to get the article title
arstechnicaData$title <- mapply({function(html) read_html(html) %>% html_nodes("h1") %>% html_text()}, arstechnicaData$page)
# parse the html to get the article first paragraph
arstechnicaData$firstParagraph <- mapply({function(html) {
x <- read_html(html) %>%
html_nodes(".article-content p")
html_text(x[1])
}
}, arstechnicaData$page)
data <- sprintf(
'<p class="list-item"> <span class="title"> <a href="%s" target="<!!>">%s</a> </span> <span class="first-paragraph"> %s </span> <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </p>',
arstechnicaData$url,
arstechnicaData$title,
arstechnicaData$firstParagraph,
arstechnicaData$url,
arstechnicaData$url
)
SpaceX plans worldwide satellite Internet with low latency, gigabit speed SpaceX has detailed ambitious plans to bring fast Internet access to the entire world with a new satellite system that offers greater speeds and lower latency than existing satellite networks. https://arstechnica.com/information-technology/2016/11/spacex-plans-worldwide-satellite-internet-with-low-latency-gigabit-speed
Xamarin now free in Visual Studio SAN FRANCISCO—Microsoft bought Xamarin, the popular C#-and-.NET-on-iOS-and-Android, last month. At its Build developer conference today, the company announced the first big step for its new acquisition: Xamarin is now included in every Visual Studio version. https://arstechnica.com/information-technology/2016/03/xamarin-now-free-in-visual-studio
How “omnipotent” hackers tied to NSA hid for 14 years—and were found at last CANCUN, Mexico — In 2009, one or more prestigious researchers received a CD by mail that contained pictures and other materials from a recent scientific conference they attended in Houston. The scientists didn’t know it then, but the disc also delivered a malicious payload developed by a highly advanced hacking operation that had been active since at least 2001. The CD, it seems, was tampered with on its way through the mail. https://arstechnica.com/security/2015/02/how-omnipotent-hackers-tied-to-the-nsa-hid-for-14-years-and-were-found-at-last
Facebook scraped call, text message data for years from Android phones [Updated] [Update, March 25, 2018, 20:24 Eastern Time]: Facebook has responded to this and other reports regarding the collection of call and SMS data with a blog post that denies Facebook collected call data surreptitiously. The company also writes that it never sells the data and that users are in control of the data uploaded to Facebook. This “fact check” contradicts several details Ars found in analysis of Facebook data downloads and testimony from users who provided the data. More on the Facebook response is appended to the end of the original article below. https://arstechnica.com/information-technology/2018/03/facebook-scraped-call-text-message-data-for-years-from-android-phones
Microsoft and GitHub team up to take Git virtual file system to macOS, Linux One of the more surprising stories of the past year was Microsoft’s announcement that it was going to use the Git version control system for Windows development. Microsoft had to modify Git to handle the demands of Windows development but said that it wanted to get these modifications accepted upstream and integrated into the standard Git client. https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux
Sorry, Comcast: Voters say “yes” to city-run broadband in Colorado Voters in Fort Collins, Colorado, yesterday approved a ballot question that authorizes the city to build a broadband network, rejecting a cable and telecom industry campaign against the initiative. https://arstechnica.com/tech-policy/2017/11/voters-reject-cable-lobby-misinformation-campaign-against-muni-broadband
Google’s constant product shutdowns are damaging its brand It’s only April, and 2019 has already been an absolutely brutal year for Google’s product portfolio. The Chromecast Audio was discontinued January 11. YouTube annotations were removed and deleted January 15. Google Fiber packed up and left a Fiber city on February 8. Android Things dropped IoT support on February 13. Google’s laptop and tablet division was reportedly slashed on March 12. Google Allo shut down on March 13. The “Spotlight Stories” VR studio closed its doors on March 14. The goo.gl URL shortener was cut off from new users on March 30. Gmail’s IFTTT support stopped working March 31. https://arstechnica.com/gadgets/2019/04/googles-constant-product-shutdowns-are-damaging-its-brand
Patent war goes nuclear: Microsoft, Apple-owned “Rockstar” sues Google Canada-based telecom Nortel went bankrupt in 2009 and sold its biggest asset—a portfolio of more than 6,000 patents covering 4G wireless innovations and a range of technologies—at an auction in 2011. https://arstechnica.com/tech-policy/2013/10/patent-war-goes-nuclear-microsoft-apple-owned-rockstar-sues-google
After 37 years, Voyager 1 has fired up its trajectory thrusters At present, the Voyager 1 spacecraft is 21 billion kilometers from Earth, or about 141 times the distance between the Earth and Sun. It has, in fact, moved beyond our Solar System into interstellar space. However, we can still communicate with Voyager across that distance. https://arstechnica.com/science/2017/12/after-37-years-voyager-has-fired-up-its-trajectory-thrusters
Mickey Mouse will be public domain soon—here’s what that means As the ball dropped over Times Square last night, all copyrighted works published in 1923 fell into the public domain (with a few exceptions). Everyone now has the right to republish them or adapt them for use in new works. https://arstechnica.com/tech-policy/2019/01/a-whole-years-worth-of-works-just-fell-into-the-public-domain
githubData <- getDomainData(con, 'github.com')
githubData <- githubData %>%
filter(!(path %in% c("/search", "/sponsors")))
githubData <- githubData[1:10, ]
data <- sprintf(
'<p class="list-item"> <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </p>',
githubData$url,
githubData$url
)
https://github.com/minimaxir/big-list-of-naughty-strings
https://github.com/jmdugan/blocklists/blob/master/corporations/facebook/all
https://github.com/Eloston/ungoogled-chromium
https://github.com/dear-github/dear-github
https://github.com/docker/docker.github.io/issues/6910
https://github.com/blog/2164-introducing-unlimited-private-repositories
https://github.com/shadowsocks/shadowsocks-iOS/issues/124
Sometimes adblocks don’t display the twitter cards, then you will see just the links
twitterData <- getDomainData(con, 'twitter.com')
twitterData <- twitterData %>%
filter(!(path %in% c("/search")))
twitterData <- twitterData[1:10, ]
twitterData$html <- mapply({function(url) {
url <- paste("https://publish.twitter.com/oembed?url=", url ,sep = "")
url %>%
GET() %>%
content(as = "text") %>%
fromJSON() %>%
.$html
}
},
twitterData$url)
data <- sprintf(
'<div class="list-item"> %s <span class="link"> <a href="%s" target="<!!>">%s</a> </span> </div>',
twitterData$html,
twitterData$url,
twitterData$url
)
https://twitter.com/lemiorhan/status/935578694541770752Dear @AppleSupport, we noticed a HUGE security issue at MacOS High Sierra. Anyone can login as “root” with empty password after clicking on login button several times. Are you aware of it @Apple?
— Lemi Orhan Ergin (@lemiorhan) November 28, 2017
https://twitter.com/ctavan/status/1044282084020441088“Clear all Cookies except Google Cookies”, thanks Chrome. /cc @matthew_d_green pic.twitter.com/tR0UJjtPFL
— Christoph Tavan (@ctavan) September 24, 2018
https://twitter.com/FrancescoC/status/1119596234166218754It is with great sadness that I share news of Joe Armstrong's passing away earlier today. Whilst he may no longer be with us, his work has laid the foundation which will be used by generations to come. RIP @joeerl, thank you for inspiring us all.
— Francesco Cesarini (@FrancescoC) April 20, 2019
https://twitter.com/Senficon/status/1014814460488413185Great success: Your protests have worked! The European Parliament has sent the copyright law back to the drawing board. All MEPs will get to vote on #uploadfilters and the #linktax September 10–13. Now let's keep up the pressure to make sure we #SaveYourInternet! pic.twitter.com/VwqAgH0Xs5
— Julia Reda (@Senficon) July 5, 2018
https://twitter.com/sarahjeong/status/735924335412543488🚨 GOOGLE'S USE OF THE DECLARING CODE AND SSO OF APIS IS FAIR USE 🚨
— sarah jeong (@sarahjeong) May 26, 2016
https://twitter.com/GambleLee/status/862307447276544000A crashed advertisement reveals the code of the facial recognition system used by a pizza shop in Oslo… pic.twitter.com/4VJ64j0o1a
— Lee Gamble (@GambleLee) May 10, 2017
https://twitter.com/NASA/status/539814651404754944We're sending humans to Mars! Watch our #JourneytoMars briefing live today at 12pm ET: http://t.co/6XtjOi1yJo #Orion pic.twitter.com/wrf89sn35A
— NASA (@NASA) December 2, 2014
https://twitter.com/hintjens/status/783254242052206592I'm choosing euthanasia etd 1pm.
— Pieter Hintjens (@hintjens) October 4, 2016
I have no last words.
https://twitter.com/KodyKinzie/status/1146196570083192832We made a video about launching fireworks over Wi-Fi for the 4th of July only to find out @YouTube gave us a strike because we teach about hacking, so we can't upload it.
— Kody (@KodyKinzie) July 2, 2019
YouTube now bans: “Instructional hacking and phishing: Showing users how to bypass secure computer systems”