Analicé todos los libros mencionados en Stack Overflow. Éstos son los más populares.

Encontrar su próximo libro de programación es difícil y arriesgado.

Como desarrollador, su tiempo es escaso y leer un libro consume mucho de ese tiempo. Podrías estar programando. Podrías estar descansando. Pero, en cambio, está dedicando un tiempo precioso a leer y ampliar sus habilidades.

Entonces, ¿qué libro deberías leer? Mis colegas y yo a menudo discutimos libros, y he notado que nuestras opiniones sobre un libro determinado varían enormemente.

Así que decidí profundizar en el problema. Mi idea: analizar el recurso para programadores más popular del mundo en busca de enlaces a una librería conocida y luego contar cuántas menciones tiene cada libro.

Afortunadamente, Stack Exchange (la empresa matriz de Stack Overflow) acababa de publicar su volcado de datos. Así que me senté y me puse a programar.

“Si tiene curiosidad, el libro más recomendado en general es Working Effectively with Legacy Code, con Design Pattern: Elements of Reusable Object-Oriented Software en segundo lugar. Si bien los títulos de estos son tan secos como el desierto de Atacama, el contenido debe ser de calidad. Puede ordenar los libros por etiquetas, como JavaScript, C, Graphics y cualquier otra cosa. Obviamente, este no es el fin de todas las recomendaciones de libros, pero sin duda es un buen lugar para comenzar si solo está comenzando a programar o si busca fortalecer sus conocimientos ". - reseña en Lifehacker.com

Poco después, lancé dev-books.com, que te permite explorar todos los datos que recopilé y clasifiqué. Recibí más de 100,000 visitantes y recibí muchos comentarios que me pedían que describiera todo el proceso técnico.

Entonces, como prometí, voy a describir cómo construí todo ahora mismo.

Obtener e importar los datos

Tomé el volcado de la base de datos de Stack Exchange de archive.org.

Desde el principio me di cuenta de que no sería posible importar un archivo XML de 48 GB en una base de datos recién creada (PostgreSQL) utilizando métodos populares como myxml := pg_read_file(‘path/to/my_file.xml’), porque no tenía 48 GB de RAM en mi servidor. Entonces, decidí usar un analizador SAX.

Todos los valores se almacenaron entre ow> tags, so I used a Python script to parse it:

After three days of importing (almost half of the XML was imported during this time), I realized that I’d made a mistake: the ParentID attribute should have been ParentId.

At this point, I didn’t want to wait for another week, and moved from an AMD E-350 (2 x 1.35GHz) to an Intel G2020 (2 x 2.90GHz). But this still didn’t speed up the process.

Next decision — batch insert:

StringIO lets you use a variable like file to handle the function copy_from, which uses COPY. This way, the whole import process only took one night.

OK, time to create indexes. In theory, GiST indexes are slower than GIN, but take less space. So I decided to use GiST. After one more day, I had an index that took 70GB.

When I tried couple of test queries, I realized that it takes way too much time to process them. The reason? Disk IO waits. SSD GOODRAM C40 120Gb helped a lot, even if it is not the fastest SSD so far.

I created a brand new PostgreSQL cluster:

initdb -D /media/ssd/postgresq/data

Then I made sure to change the path in my service config (I used Manjaro OS):

vim /usr/lib/systemd/system/postgresql.service
Environment=PGROOT=/media/ssd/postgresPIDFile=/media/ssd/postgres/data/postmaster.pid

I Reloaded my config and started postgreSQL:

systemctl daemon-reloadpostgresql systemctl start postgresql

This time it took couple hours to import, but I used GIN. The indexing took 20GB of space on SSD, and simple queries were taking less than a minute.

Extracting books from the database

With my data finally imported, I started to look for posts that mentioned books, then copied them over to a separate table using SQL:

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE ‘%book%’”;

The next step was to find all the hyperlinks within those:

CREATE TABLE http_books AS SELECT * posts WHERE body LIKE ‘%http%’”;

At this point I realized that StackOverflow proxies all links like: rads.stackowerflow.com/[$isbn]/

I created another table with all posts with links:

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE ‘%//rads.stackowerflow.com%'";

Using regular expressions to extract all the ISBNs. I extracted Stack Overflow tags to another table through regexp_split_to_table.

Once I had the most popular tags extracted and counted, the top of 20 most mentioned books by tags were quite similar across all tags.

My next step: refining tags.

The idea was to take the top-20-mentioned books from each tag and exclude books which were already processed.

Since it was “one-time” job, I decided to use PostgreSQL arrays. I wrote a script to create a query like so:

With the data in hand, I headed for the web.

Building the web app

Since I’m not a web developer — and certainly not a web user interface expert — I decided to create a very simple single-page app based on a default Bootstrap theme.

I created a “search by tag” option, then extracted the most popular tags to make each search clickable.

I visualized the search results with a bar chart. I tried out Hightcharts and D3, but they were more for dashboards. These had some issues with responsiveness, and were quite complex to configure. So, I created my own responsive chart based on SVG. To make it responsive, it has to be redrawn on screen orientation change event:

Web server failure

Right after I published dev-books.com I had a huge crowd checking out my web site. Apache couldn’t serve for more than 500 visitors at the same time, so I quickly set up Nginx and switched to it on the way. I was really surprised when real-time visitors shot up to 800 at same time.

Conclusion:

I hope I explained everything clearly enough for you to understand how I built this. If you have any questions, feel free to ask. You can find me on twitter and Facebook.

As promised, I will publish my full report from Amazon.com and Google Analytics at the end of March. The results so far have been really surprising.

Make sure you click on green heart below and follow me for more stories about technology :)

Stay tuned at dev-books.com