Climate change and the future of data: distributed architectures, parallel realities

Recently a grass-roots initiative lead by scientists and researchers was launched to try and preserve data related to climate change that is believed to be under threat under president elect Donald Trump that is going to officially assume power in the beginning of 2017.
On this occasion, George Anadiotis in his capacity of ZDNet contributor set to explore the interplay between institutions and data and the role advances in data technology can play. In the context of this article written for ZDNet, he reached out to spokespeople from EDGI (Environmental Data & Governance Initiative) and Protocol Labs. EDGI is part of the Climate Mirror movement, while Protocol Labs is a company developing and advocating a new distributed protocol for the web.
In the following, the unabridged version of interviews conducted over email with Matt Price (Director of Digital Development)  & Nick Shapiro (Executive Director) from EDGI and Matt Zumwalt, Program Manager at Protocol Labs is presented.
Questions:
1. How likely do you think it is for climate data sets to actually disappear? At some point on Monday December the 19th 2016, it appeared that the main entry point for NOAA’s data was not accessible, giving a web server HTTP 404 Not Found Error:
Inline image 1
Do you think there is some sort of analogy between data and cultural resources? Would disabling access to data valuable for scientific research somehow equate to acts like destroying cultural heritage? If that is the case, would you say Climate Mirror initiative would be the digital equivalent of the 3D restoration of the Bamiyan statues?
2. One argument on the topic emphasizes that even if existing data is preserved via the collective efforts of people involved in this initiative, if data collection efforts from this point on are undermined, climate change research would be hampered significantly on a global scale.
One potential way of dealing with this is crowdsourcing climate data collection. Although there are known issues related with data quality for crowdsourced data, there are also potential solutions put forward and this paradigm is already in use for commercial applications such as weather forecast.
Even though crowdsourcing cannot at this point provide high quality and precision data using specialized equipment, do you think it can complement data from research and scientific agencies, and what would be needed to both involve more people in crowdsourcing and make the best use of collected data?
3. Would you say that this particular case points towards a need for change on the institutional level, and does that relate to technological capabilities?
For example, some of NOAA’s data was already hosted in the public cloud for free. And if successful, the Climate Mirror initiative will also host some of that data in servers of universities and research institutions. In both cases however, that mitigates, but does not eradicate the issue.
In both cases, there is a cost associated with hosting the data. Cloud providers like AWS have chosen to foot this bill as they expect that increased demand for computation on that data, also provided by them under commercial terms, will help them cover the associated cost.
Universities and research institutions don’t have a direct way to monetize that data, but they host them as they are required for their work and/or as a way to make a statement. So even though having more copies of the data to go by is an improvement compared to having a single point of access, if the cost of hosting that data or any other reason forces hosts to stop hosting the data, access will be lost.
Another related issue is the authenticity of data. For data hosted by NOAA for example, the assumption is that they are authentic as they are published by a credible agency. If the same data however is copied and hosted by a different party, or even if the original publisher for some reason loses its credibility, it is not inconceivable that data could be tampered with.
Advances in distributed and peer-to-peer technology and the commoditization of storage and compute may offer some solutions there. For example, the Internet Archive recently put out a call for a distributed web. Potential solutions also applicable in this case could include new protocols such as IPFS or Blockchain, databases based on principles such as immutability and distribution, or personal cloud solutions like OwnCloud and NextCloud.
Do you see these playing a part going forward?
4. Another related issue is that even when scientific data like data on climate change is available, that does not necessarily make them easily accessible by all interested parties.
Physical access barriers aside, data is scattered across various repositories, encoded in various formats, typically needs manual processing and does not allow for querying, aggregation and cross-linking, and lacks metadata and context that would make it easier to understand and process.
This makes it hard even for domain experts to use that data and get insights, let alone for the general public. Consequently, driving at disputable conclusions will be inevitable and will require expert intervention to debunk.
Do you think data format standardization, metadata, documentation, and, ultimately, platforms for accessing environmental data that are built with a consumer-oriented philosophy could help there?
Answers – EDGI:
1. This is of course hard to judge, but we have good reasons to believe at least some and likely a high percentage of data will disappear from public view. For these reasons we believe taking precautionary measures is justified:

(1) Year over year, large amounts of information disappear from the web with astonishing regularity.  The natural turnover of information would account for some level of attrition.
(2) Data does not just keep itself alive and accessible; it can be starved out of public view, by cutting off the resources necessary for its upkeep (this includes server costs, web (re)design, bitrot preservation, etc.).
(3) The Trump administration has signalled a hostility to climate change research especially, and to environmental science in general, and of course, to the use of evidence and reason even more universally.  The advisors in charge of climate-related programs and environmental enforcement (Pruitt, Perry, Sessions, Zinke, Ebell) are all on record voicing their skepticism of the scientific consensus around anthropogenic climate change.
(4) There is no reason to believe an administration hostile to climate science and environmental regulation will not avail itself of the tools developed by similiarly-minded administrations in other places, such as Canada, where a Conservative government shuttered libraries, destroyed archives, ended research programs, and muzzled scientists until their defeat at the polls only one year ago. The attempted closing of EPA libraries under George W Bush is another historical president that informs our approach.
Give the above, and the very high stakes of the issue, we think it would be deeply imprudent to rely in the good will of the Trump Administration in this matter.
We would perhaps go further, and suggest that scientific data is precisely a form of cultural heritage which belongs to all of us and to our descendants. While the scientific enterprise has never been perfect, and at various times has been deeply intermingled with the history of discrimination and injustice,  it has also promoted a principle vital to our societies: that anyone can propose theoretical accounts of to make sense of empirical data, and be judged only on the basis of their arguments, regardless of gender, race, social status, or other constraints.

Without evidence, those arguments can’t be made, and our common heritage of reasoned discourse disappears. So the duty to preserve data is, in our view, as strong as the duty of cultural preservation.
(just a side note that there are many initiatives right now that are all coordinating, several EDGI members are in the Climate Mirror Slack and many UPenn members are in our slack)
2.

It will take tremendous efforts to replicate the co-ordinated data-gathering that governments and other large institutions have been able to provide. However, if it comes to that, we will need to build on the kind of efforts that citizens have made since 1970’s when they found themselves abandoned by governmental authorities. With contemporary information technologies, it should be possible to find solutions to the two major difficulties of previous efforts — standardization of observation protocols, and amalgamation of results — that made it difficult for those early efforts to succeed.  Projects like Galaxy Zoo and Folding@Home have already shown that Internet-mediated crowdsourcing can be highly effective in some situations.  We would need to develop a set of protocols, and we would need to harness the efforts of many, many people — but with the urgency of the issue, we might have a fighting chance of succeeding.
This is not just the case with climate data but also data relevant to environmental justice–our collaborators at Public Lab helped to write that report. While crowdsourcing may be possible (and is already a silent backbone of meteorological and ornithology) it should not be on civil society to perform these tasks without government support.
3.
Yes :-).  As this project has developed, we’ve come to realize that it essentially opens on to an enormous question about the future of the Internet and data in general.  The sudden shift in expectations brought on by a single election in one (albeit very powerful) country highlights the fragility of our current data management strategies.  If governments and large institutions can’t be relied on to preserve the public good (in the form of public data) — and in the long run, they probably can’t be — then we need to engage in a very serious rethinking of the foundations of the Internet  All of the options you discuss are very much in the forefront of our internal discussions.
4.
These technical considerations are essential but not sufficient. The cultural turn away from evidence needs to be countered by a critical analysis that acknowledges the imperfection of knowledge while still valuing the quest for truth, however elusive it might be.  So we need a fundamental reversal in the momentum of public discourse, away from a solipsistic scepticism and towards a collective, reinvigorated, search for inclusive common standards of evidence and verification. The tools you describe can provide immense assistance here, but in our view the underlying cultural shift is more fundamental, and will have to be carried out through hard, sustained, grassroots effort.
Answers – Protocol Labs:
1.

It’s not only the data that are precarious. The entire web is a precarious system because it’s structured as a centralized network. The solution is to upgrade to a decentralized web. This scramble to save climate data is just the latest symptom of the underlying disease of centralization. As long as we continue to rely on the current centralized approach to publishing and citing data, we will continue to face the problem of links breaking and data disappearing. The more valuable and central these data become, the more expensive and painful it will be to continue relying on a centralized network.

Destroying climate data is the equivalent of a book burning. Keeping copies of the data in many places is the equivalent of sending copies of a book to many places rather than keeping one copy of the book in a single, trusted location. When the library of Alexandria burned, most of that information was lost forever. However, some of the information did survive because people had copies of those scrolls in other places.

2.

Crowdsourcing is about many people participating in the work of producing data. In order for crowdsourcing to truly work on a global scale we also need structures for everyone to participate in holding those data. We also need to allow people to create and exchange multiple “forks” of data, in the same way that git allows software developers to track multiple forks of the software they’re working on. This will not only make the data infrastructure resilient, it will force us to grapple directly and honestly with questions of attribution, authority, authenticity and trustworthiness. The existing centralized approach to data collection, data management and publication of research has allowed us to operate under the illusion that those are solved problems. Clearly they’re not.

3.

Any institutions that think of themselves as stewards of knowledge should be using decentralized approaches to publish and preserve data. These technologies make it possible for data to pass through many hands without losing integrity. That’s what we need. We need to leave the door open for anyone to become stewards of information at any time. Remember, we wouldn’t have copies of Plato’s works today if people had relied exclusively on the Library of Alexandria, or even the whole Roman empire, to hold those texts. How many people played a role in preserving that information through Europe’s dark ages? Where did those people live? Who paid them to hold onto it?

Location is a bad proxy for authenticity. Servers can be hacked and, as we are witnessing now, organizations can be forced to turn off servers (or forced to intentionally corrupt the server’s contents). If you want to publish data in a way that ensures its integrity and allows everyone to validate the authenticity of the data, you should be publishing the cryptographic hashes of your data and using a content-addressed protocol like IPFS to publish the data rather than using a location-addressed protocol like HTTP.

For further explanation of content-addressing and its benefits, see this article on saving endangered data: https://medium.com/@flyingzumwalt/instructions-for-saving-endangered-data-its-time-to-get-decentralized-23fb96aa8179#.m3iwvb8hp

This effort to set up climate data mirrors is an attempt to decentralize data. It needs decentralized technologies like IPFS and blockchains in order to succeed because those tools are designed to handle the technical obstacles that inevitably arise when you switch to a decentralized approach.

4. This particular set of issues is just the latest manifestation of the crisis of reproducibility that’s currently gripping world of peer-reviewed research. Content-addressing, which lies at the core of decentralized technologies like git, BitTorrent and IPFS, is an absolutely essential tool for addressing these goals of making research reproducible and making data analysis horizontally scalable on a global scale. With Content Addressing, any copy on the network can be used to service any request, so that the many repositories hosting the data can serve all the users together. With Content Addressing, it no longer matters that different repositories host something, it only matters that they can serve it to you. This will significantly help to unify fragmented services into a single combined effort. Specifically regarding data formats and cross-linking, we have developed IPLD (http://ipld.io/) as a generic format for traversing content-addressed data.