507 lines
18 KiB
HTML
507 lines
18 KiB
HTML
<!doctype html>
|
|
<html lang="en">
|
|
<head>
|
|
<title>DataLad beyond Git</title>
|
|
<meta name="description" content="DataLad has been built on Git and git-annex as foundational pillars. However, the vast majority of data infrastructures are not Git-aware. Git-annex can work with a much broader array of services, but the need to 'keep the Git repo somewhere' imposes undesirable technical and procedural complexity on users. In this talk I illustrate existing means to take Git-based DataLad datasets to places that Git cannot reach on its own. Moreover, I introduce ongoing work that aims to enable DataLad users to consume non-DataLad resources as native DataLad datasets, and non-DataLad users to consume DataLad resources without DataLad, git-annex, or even Git. some description ">
|
|
<meta name="author" content="Michael Hanke">
|
|
|
|
<meta charset="utf-8">
|
|
<meta name="apple-mobile-web-app-capable" content="yes" />
|
|
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
|
|
<link rel="stylesheet" href="common/css/main.css" id="theme">
|
|
<script src="common/js/printpdf.js"></script>
|
|
</head>
|
|
<body>
|
|
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
<section>
|
|
<h1>DataLad beyond Git<br><small>Connecting to the rest of the world</small></h1>
|
|
<p>Michael Hanke</p>
|
|
<p>
|
|
<small>Institute of Neuroscience and Medicine, Brain & Behavior (INM-7),
|
|
Research Center Jülich</small><br>
|
|
<small>Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf</small></br>
|
|
<p><img style="height:50px;margin-bottom:-12px;margin-right:10px" data-src="common/img/mastodon.svg" />@mih@mas.to
|
|
<a href="http://psychoinformatics.de">http://psychoinformatics.de</a></p>
|
|
<p style="margin-top:50px"><img style="height:100px;margin-right:100px" data-src="common/img/fzj_logo.svg" />
|
|
<img style="height:100px" data-src="common/img/hhu_logo.svg" /></p>
|
|
</section>
|
|
|
|
<section>
|
|
<h2>Acknowledgements</h2>
|
|
<table>
|
|
<tr style="vertical-align:middle">
|
|
<td style="vertical-align:middle">
|
|
<dl style="margin-bottom:20px">
|
|
<dt style="margin-top:20px">DataLad software <br>
|
|
& ecosystem</dt>
|
|
<dd style="margin-left:5px!important">
|
|
<ul style="margin-left:5px!important">
|
|
<li>Psychoinformatics Lab, <br>
|
|
Research Center Jülich</li>
|
|
<li>Center for Open <br>
|
|
Neuroscience, <br>
|
|
Dartmouth College</li>
|
|
<li>Joey Hess (git-annex)</li>
|
|
<li><em>>100 additional contributors</em></li>
|
|
</ul>
|
|
</dd>
|
|
</dl>
|
|
</td>
|
|
<td style="vertical-align:middle">
|
|
<div style="margin-top:-20px;margin-bottom:-20px;text-align:center"><strong>Funders</strong></div>
|
|
<div style="margin-top:-20px">
|
|
<img style="height:150px;margin-right:0px" data-src="common/img/nsf.png" />
|
|
<img style="height:150px;margin-right:0px;margin-left:40px" data-src="common/img/binc.png" />
|
|
<img style="height:150px;margin-left:0px" data-src="common/img/bmbf_datalad.png" />
|
|
</div>
|
|
<div style="margin-top:-20px">
|
|
<img style="height:80px;margin-top:0px;margin-left:30px" data-src="common/img/fzj_logo.svg" />
|
|
<img style="height:60px;margin-left:50px;margin-bottom:25px" data-src="common/img/dfg_logo.png" />
|
|
</div>
|
|
<div style="margin-top:-20px">
|
|
<img style="height:100px" data-src="common/img/erc_logo.png" />
|
|
<img style="height:60px;margin-bottom:35px" data-src="common/img/erdf.png" />
|
|
</div>
|
|
<div style="margin-top:-20px">
|
|
<img style="height:80px;margin-right:20px;margin-bottom:5px" data-src="common/img/nrw_mkw_logo.png" />
|
|
<img style="height:60px;margin-right:20px" data-src="common/img/cbbs_logo.png" />
|
|
<img style="height:60px" data-src="common/img/LSA-Logo.png" />
|
|
</div>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td colspan=2 width="100%">
|
|
<div style="margin-top:0px">
|
|
<div style="margin-top:20px;margin-bottom:-50px"><strong>Collaborators</strong></div>
|
|
<img style="height:100px;margin:0px;margin-left:100px" data-src="common/img/cbrain_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="common/img/hbp_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="common/img/conp_logo.png" />
|
|
<img style="height:120px;margin:10px" data-src="common/img/openneuro_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="common/img/ebrain-health-logo.png"/>
|
|
<img style="height:100px;margin:20px" data-src="common/img/GIN_logo.png" />
|
|
</div>
|
|
<div style="margin-top:-20px;text-align:center">
|
|
<img style="height:120px;margin:20px" data-src="common/img/sfb1451_logo.png" />
|
|
<img style="height:140px;margin:10px" data-src="common/img/brainlife_logo.png" />
|
|
<img style="height:100px;margin:20px" data-src="common/img/vbc_logo.png" />
|
|
</div>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Mindset: Everything is distributed
|
|
|
|
|
|
Resources, people, expertise, services
|
|
|
|
<!-- .element: height="400" -->
|
|
<!-- .element: style="float:right" -->
|
|
<div style="width:800px">
|
|
|
|
- **Version-control** is an organizer/safety wrapper around processes and people (including self)
|
|
- Progress requires a **collaboration** with an ever changing group of people, across different locations
|
|
- Success is an incremental and **sustainable achievement** built on a trustworthy foundation
|
|
|
|
</div>
|
|
|
|
*DataLad is a productivity tool for a distributed world*
|
|
|
|
(inspired by the free software movement)
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## DataLad "world model"
|
|

|
|
|
|
DataLad is an orchestrator for Git and git-annex
|
|
</script></section>
|
|
|
|
<section>
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Question to answer when going distributed
|
|
|
|
<!-- .element: style="margin-top:100px" height="400" -->
|
|
<!-- .element: style="float:right" -->
|
|
<div style="width:900px">
|
|
|
|
- Collaborating or depositing? Are updates expected? From whom? How provided?
|
|
|
|
- Git stuff (code, metadata):
|
|
- Where can it live? Who can have it?
|
|
- Service required/desired for collaboration assistance or visibility?
|
|
|
|
- Data stuff (large and/or binary blobs):
|
|
- Too big to be everywhere?
|
|
- Target Audience exactly identical to Git-stuff?
|
|
- Does it evolve?
|
|
|
|
</div>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Solutions for joint hosting
|
|
|
|
Git repo and data hosted at the same location/service
|
|
|
|
- Self-hosted Git repos with annex (related: DataLad RIA store)
|
|
- Git-hosting with Git-LFS
|
|
- Git-hosting with built-in annex support (GIN)
|
|
|
|
*Joint-hosting is attractive (complexity low),<br> and possible at any location that Git can reach.*
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Git is a constraint
|
|
|
|
- Issues
|
|
- Most cloud storage excluded
|
|
- Many institutions implement Git-is-for-code services only
|
|
- Git-LFS is unsatisfactory when data deletion is (frequently) necessary
|
|
- `git-annex export` is single-version, withholds advantages of distributed VCS from consumers
|
|
|
|
- Solution: Compound hosting
|
|
- Git repo hosted on Git-aware/compatible infrastructure (e.g. GitHub for reach)
|
|
- Data host anywhere (cheap enough, large enough, safe enough)
|
|
- Benefit: access managed separately for data vs metadata (think personal data)
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Good enough?
|
|
|
|
<!-- .element: style="margin-top:50px;float:right" height="400" -->
|
|
<div style="width:900px;text-align:left">
|
|
|
|
No!
|
|
|
|
- Minimum of two separate systems to support/maintain
|
|
- Often different authorities
|
|
- Need to get another services approved for use
|
|
|
|
But the benefit of separate access management?
|
|
|
|
- When target audiences for data and metadata are identical, there is no benefit
|
|
|
|
</div>
|
|
</script></section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Git remote helper: datalad-annex
|
|
|
|
- Deposit a Git repo via git-annex (possibly inside another annex)
|
|
|
|
- Establish git-annex remote as common interface for Git or git-annex data transport
|
|
|
|
<!-- .element: style="margin-top:100px" height="400" -->
|
|
<!-- .element: style="float:right" -->
|
|
|
|
<div style="width:900px">
|
|
|
|
|
|
- Idea:
|
|
|
|
- Represent a Git remote as two annex keys
|
|
|
|
1. Plain-text list of `refs`
|
|
2. Zipped, bare Git repo with the refs
|
|
|
|
- Use a custom Git-annex key backend (XDLRA) to bypass any content verification and selectively employ Git-annex remotes for transport
|
|
|
|
- Key names are *not* content-based ↷<br> one deposit per unique remote setup/annex
|
|
</div>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## datalad-annex internals
|
|

|
|
|
|
- `git-fetch`: check `refs`, copy `repo-export`, unpack, fetch
|
|
- `git-push`: fetch, push, pack, copy `repo-export`, update `refs`
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## datalad-annex examples
|
|
|
|
- URL-encode full `initremote` parameter list
|
|
- Parameter expansion support for URL components.
|
|
|
|
<div style="text-align:left">
|
|
|
|
Public S3 bucket (export)
|
|
|
|
```
|
|
datalad-annex::?type=S3&encryption=none&bucket=<BUCKET>&exporttree=yes&public=yes 🢱
|
|
&encryption=none
|
|
```
|
|
|
|
Dataverse dataset (by DOI with annex object tree)
|
|
|
|
```
|
|
datalad-annex::?type=external&externaltype=dataverse&encryption=none&exporttree=no 🢱
|
|
&url=https%3A//demo.dataverse.org&doi=doi:10.70122/MYT/ESTDOI
|
|
```
|
|
|
|
Zipped repo at (localhost) `/tmp/XDLRA--repo-export`
|
|
```
|
|
datalad-annex::file:///tmp?type=external&externaltype=uncurl&encryption=none 🢱
|
|
&url={noquery}/{{annex_key}}'
|
|
|
|
datalad-annex::ssh://localhost/tmp?type=external&externaltype=uncurl& 🢱
|
|
encryption=none&url={noquery}/{{annex_key}}'
|
|
```
|
|
|
|
Zipped Git repo at https://example.com/.datalad/dotgit/repo.zip
|
|
|
|
```
|
|
datalad-annex::https://example.com
|
|
```
|
|
|
|
</div>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## datalad-annex summary
|
|
|
|
- **Not** a high-performance, collaboration utility for centralized contributor workflows
|
|
|
|
- **But** a flexible repository deposition helper: one provider, many consumers
|
|
|
|
- Confirmed to work with git-annex v8.20211123 or later,<br>
|
|
should work with any annex remote implementation
|
|
|
|
- Available from `datalad-next` extension package
|
|
|
|
<note>
|
|
http://docs.datalad.org/projects/next/en/latest/generated/datalad_next.gitremotes.datalad_annex.html
|
|
</note>
|
|
</script></section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Longevity: DataLad is a liability
|
|

|
|
<note>https://social.sciences.re/@zimoun/112036749331120124</note>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## And while we are at it: Git and git-annex too
|
|
|
|
<!-- .element: height="400" -->
|
|
<!-- .element: style="float:right" -->
|
|
|
|
<div style="text-align:left;width:900px">
|
|
|
|
- Imagine, finding a CVS repository from 1994 with pointers to data (on tapes) in some language documented in a TROFF-formatted manual...
|
|
|
|
- Imagine, finding a "fast-exported" git-annex repository in 2054 with pointers to data (stored at something reachable by software from 2024)...
|
|
|
|
- ... two archeology projects
|
|
|
|
</div>
|
|
|
|
<div style="margin-top:100px">
|
|
|
|
**Data preservation demands a data-description optimized record,<br>
|
|
not a data-use optimize record.**
|
|
|
|
*But still, we want and need to use today's data, today, with today's tech.*
|
|
</div>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Future of DataLad: Use while you benefit, only
|
|

|
|
|
|
Convert DataLad datasets to/from a variety of metadata standards.
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Metadata schema for data-distributions
|
|
|
|
<div style="font-size:90%">
|
|
|
|
- A schema, **not a new ontology/vocabulary**
|
|
|
|
- Semantics comprehensively defined, **RDF-serialization supported**
|
|
|
|
- **Built on W3C PROV-O and DCAT** (embracing ODRL)
|
|
|
|
- Developed with `linkml`<br>
|
|
(https://linkml.io; generate OWL, SHACL, ... as needed)
|
|
|
|
- Able to capture multi-version DataLad datasets with redundant availability
|
|
|
|
- **Key ideas**
|
|
|
|
- Primary subject is file content (`DCAT:Distribution`)
|
|
|
|
- Open-world attitude: provides key structural elements, but does not prescribe or limit to a particular domain/vocabulary
|
|
|
|
- Almost everything has a globally unique identifier
|
|
|
|
- Facilitates version-on-read/export metadata workflows
|
|
|
|
</div>
|
|
|
|
<note>Work in progress at: https://concepts.datalad.org</note>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
#### Example
|
|
|
|
<div style="font-size:70%">
|
|
|
|
Stored metadata:
|
|
|
|
```bash
|
|
id: exthisdsver:./some/path.ext
|
|
is_distribution_of: exthisdsver:#some/path
|
|
relation:
|
|
- id: exthisdsver:#some/path
|
|
meta_type: dldist:Resource
|
|
description: Some tabular data
|
|
is_part_of: exthisdsver:#
|
|
- id: exthisdsver:#
|
|
meta_type: dldist:Resource
|
|
description: A version of a collection of some data
|
|
is_version_of: exthisds:#
|
|
- id: exthisds:#
|
|
meta_type: dldist:Resource
|
|
description: A collection of some data
|
|
```
|
|
|
|
Reported metadata:
|
|
|
|
```
|
|
> linkml-convert -s <schema> -t ttl ↷
|
|
-P exthisdsver=gitsha:ab34ef11/ -P exthisds=datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/ data.yaml
|
|
|
|
@prefix dldist: <https://concepts.datalad.org/s/distribution/unreleased/> .
|
|
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
|
|
|
|
<gitsha:ab34ef11/./some/path.ext> dldist:is_distribution_of <gitsha:ab34ef11/#some/path> ;
|
|
dldist:meta_type "dldist:Distribution"^^xsd:anyURI ;
|
|
dldist:relation <datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/#>,
|
|
<gitsha:ab34ef11/#>,
|
|
<gitsha:ab34ef11/#some/path> .
|
|
|
|
<datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/#> dldist:description "A collection of some data" ;
|
|
dldist:meta_type "dldist:Resource"^^xsd:anyURI .
|
|
|
|
<gitsha:ab34ef11/#> dldist:description "A version of a collection of some data" ;
|
|
dldist:is_version_of <datalad-ds:8d90dce0-f197-11ee-8620-7b745c583563/#> ;
|
|
dldist:meta_type "dldist:Resource"^^xsd:anyURI .
|
|
|
|
<gitsha:ab34ef11/#some/path> dldist:description "Some tabular data" ;
|
|
dldist:is_part_of <gitsha:ab34ef11/#> ;
|
|
dldist:meta_type "dldist:Resource"^^xsd:anyURI .
|
|
```
|
|
|
|
Examples online for: Git blob/tree/commit, annex key/remote, DataLad dataset, publication, study, subject, research topic, instrument, data type, funding, agent/entity roles, ...
|
|
</div>
|
|
|
|
</script></section>
|
|
|
|
|
|
</section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## The future
|
|
|
|
<!-- .element: height="400" style="margin-top:200px;float:right" -->
|
|
|
|
<div style="text-align:left;width:900px">
|
|
|
|
- Reduction to **only two primary outward-facing "APIs"**
|
|
|
|
- A versioned metadata schema<br>
|
|
(any tooling: JSON-LD, RDF, YAML, ...)
|
|
|
|
- git-annex external remote protocol<br>
|
|
(rely on, or provide a suitable implementation for a particular data store)
|
|
|
|
- **Reduced requirements** for "optimal" dataset hosting
|
|
|
|
- Stores files/objects
|
|
|
|
- (Optionally) accepts file/object metadata<br>
|
|
(for search/discoverability)
|
|
|
|
- Continued compatibility with Git/git-annex repositories, but this format will be a **choice** for particular use cases, and certain workflows
|
|
|
|
</div>
|
|
|
|
**Will git-annex special remotes learn the concept of object metadata?**
|
|
</script></section>
|
|
|
|
<section>
|
|
<h2>DataLad contact and more information</h2>
|
|
<table>
|
|
<tr><td>Website</td>
|
|
<td><a href="https://datalad.org">https://datalad.org</a></td>
|
|
</tr><tr><td>Documentation</td>
|
|
<td><a href="https://handbook.datalad.org">https://handbook.datalad.org</a></td>
|
|
</tr><tr><td>Talks and tutorials</td>
|
|
<td><a href="https://youtube.com/datalad">https://youtube.com/datalad</a></td>
|
|
</tr><tr><td>Development</td>
|
|
<td><a href="https://github.com/datalad">https://github.com/datalad</a></td>
|
|
</tr><tr><td>Support</td>
|
|
<td><a href="https://matrix.to/#/#datalad:matrix.org">https://matrix.to/#/#datalad:matrix.org</a></td>
|
|
</tr><tr><td>Schema development</td>
|
|
<td><a href="https://concepts.datalad.org">https://concepts.datalad.org</a></td>
|
|
</tr>
|
|
</tr><tr><td>Social media</td>
|
|
<td>@datalad@fosstodon.org</td>
|
|
</table>
|
|
</section>
|
|
</section>
|
|
|
|
|
|
</div> <!-- /.slides -->
|
|
</div> <!-- /.reveal -->
|
|
|
|
<script src="common/reveal.js/js/reveal.js"></script>
|
|
|
|
<script>
|
|
// Full list of configuration options available at:
|
|
// https://github.com/hakimel/reveal.js#configuration
|
|
Reveal.initialize({
|
|
// The "normal" size of the presentation, aspect ratio will be preserved
|
|
// when the presentation is scaled to fit different resolutions. Can be
|
|
// specified using percentage units.
|
|
width: 1280,
|
|
height: 960,
|
|
|
|
// Factor of the display size that should remain empty around the content
|
|
margin: 0.1,
|
|
|
|
// Bounds for smallest/largest possible scale to apply to content
|
|
minScale: 0.2,
|
|
maxScale: 1.0,
|
|
|
|
controls: true,
|
|
progress: true,
|
|
history: true,
|
|
center: true,
|
|
|
|
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
|
|
|
// Optional reveal.js plugins
|
|
dependencies: [
|
|
{ src: 'common/reveal.js/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
|
|
{ src: 'common/reveal.js/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
|
{ src: 'common/reveal.js/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
|
{ src: 'common/reveal.js/plugin/zoom-js/zoom.js', async: true },
|
|
{ src: 'common/reveal.js/plugin/notes/notes.js', async: true }
|
|
]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|