Research project for the Dutch Digital Heritage Network (NDE) focused on predicting obsolete file formats
If you want to perform some of the analyses contained in this code repository, you need a recent Python installation and a dependency manager. I chose to use Pipenv because it specifies both dependencies, the versions used and the Python version I used to create the scripts.
You can install Pipenv with:
pip install pipenv
after which you can install the dependencies used here with:
git clone https://github.com/Antfield-Creations/NDE-monitoring-file-formats
cd NDE-monitoring-file-formats
pipenv install
This will create a virtual environment (a "virtualenv") with the installed dependencies. After installation, you can run
pipenv shell
to log into the virtual environment. The following analyses are available for you perusal:
- The common crawl analysis:
pipenv run python -m analysis.common_crawl
- The Netherlands Institute for Sound and Vision (NIBG):
pipenv run python -m analysis.nibg_analysis
. This uses the prebuilt aggregated statistics for the filetypes per month. - The Data Archiving and Networked services analysis is still a work in progress.
The code in this repository is mostly "config-driven". This means that there is a config.yaml in the root of this repository that configures which file formats are included in the analyses. You can tune them to your liking.
This code repository is installable using Pip(env), because there is a setup.py installation script in the root of this project. In the library is a Python implementation of the Bass diffusion model. It allows you to generate data for plots like this:
You can use the installation command as follows:
cd my_experimentation_folder
pipenv install git https://github.com/Antfield-Creations/NDE-monitoring-file-formats#egg=bass_diffusion
Once you have installed the library, you can use it in Python (remember to do pipenv run python
first):
from bass_diffusion import BassDiffusionModel
bass_model = BassDiffusionModel()
times = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,]
values = [200, 300, 600, 900, 800, 500, 300, 100, 50, 20]
bass_model.fit(times, values)
interpolated = bass_model.predict([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
The algorithm is very fast and can handle a lot of data. However, it's not very robust versus noise.