US Press Freedom Tracker Data now available on the decentralized web via IPFS

A technical deep-dive into IPFS, IPNS, and keeping track of changes to the database

IPFS is an interesting protocol because its content identifiers (CIDs) or ‘hashes’ are cryptographically computed from the content of the file, not its name or other metadata.

This means that every time the file’s content changes, publishing it in IPFS gets a new CID.

There is nothing in the protocol that maintains any sort of ‘revision’ relationship between the old CID and the new one. It is up to the publisher to keep track of old versions of the file (if that’s important to them). Equally, it’s up to the publisher to tell people which CID is the new one, but it would be annoying to have to keep announcing new CIDs every time the file changes.

For this reason, the ID above is an ‘IPNS’ ID, which always points to the latest version of the folder and its contents, without itself ever having to change. IPNS is a little bit like DNS, in that it’s a sort of static ‘alias’ or pointer to another destination – in this case, the latest IPFS CID of the directory.

To maintain a sort of ‘revision’ log of changes to the incidents.csv database (and when it changed), we also publish a changelog file (incidents-log.csv) which shows the previous CIDs and a timestamp of when they were published. The last line in the file is always the latest version of the incidents.csv. You can also fetch the latest file directly (rather than view the directory) by using the IPNS hash, for example:

ipns://k51qzi5uqu5dlnwjrnyyd6sl2i729d8qjv1bchfqpmgfeu8jn1w1p4q9x9uqit/incidents.csv

Feel free to look at older CIDs to see the difference, or to consult the file to find out when the latest version was published.

How often is the data published to IPFS?

We attempt to publish the latest copy of the database to IPFS every hour, but realistically the database itself changes far less frequently. The database is only published (and the changelog updated) if its content changes.

Care to share some code?

We initially tried to use what seems to be the official Python library for working with the IPFS API, but found that it doesn’t seem to support the most recent releases of go-ipfs, and is possibly semi-abandoned.

Fortunately, the go-ipfs service provides its own HTTP RPC API, so we could use Python’s requests module to talk to it.

Publishing a single file to the IPFS API is quite easy, and there are simple examples of how to do it. However, it turns out that publishing a directory containing files was a little more tricky to achieve.

It took a bit of trial and error to work out how to send multiple files in a multipart request with the right tuple values per file, in a way that matched the IPFS API’s documentation, but we got there.

For those curious, here’s a sample of what worked for us. Happy hacking!

If you’re looking to install IPFS on a Linux server, we used an Ansible role for that, which worked great.

View original article here Source