Working with data that is collected by public bodies is crucial to conducting public policy research. Open data and data that are easily accessible and re-useable are a fast growing and important part of this economy. Most public data is published as PDFs, which is not the most convenient format if one wants to use the data for any further analyses.
As a step in this direction, we began with the idea of ‘liberating’ these PDFs and ‘free’ the data extracted from them. To do so, CITAPP, along with and organised a day-long . This hackathon aimed to introduce why open data is important, followed by a session on extracting data from PDFs.
The event was attended by 42 students from IIITB, from the MTech, MS/PhD and iMtech programs. The participants divided themselves into ten teams and chose a PDF document to work with. The idea was to have participants convert tables of data out of PDFs into more accessible formats, including CSV and Speadsheets. The PDFs which the participants worked on can be found .
The day started off with an introduction at 10.30 AM by Nisha from Datameet, followed by a presentation of the PDFs, and a presentation on the tools the participants could use. The hackathon began right after lunch, and extended until 5.30 PM, with seven of the ten teams submitting ‘freed’ data from PDFs.
The extracted data is uploaded onto a public Google Driver folder, and can be found .