363 lines
14 KiB
HTML
363 lines
14 KiB
HTML
<!doctype html>
|
|
<html lang="en">
|
|
<head>
|
|
<title>Distributed data logistics with DataLad</title>
|
|
<meta name="description" content="Talk at the FZJ IT-Forum">
|
|
<meta name="author" content="Michael Hanke">
|
|
|
|
<meta charset="utf-8">
|
|
<meta name="apple-mobile-web-app-capable" content="yes" />
|
|
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
|
|
<link rel="stylesheet" href="common/css/main.css" id="theme">
|
|
<script src="common/js/printpdf.js"></script>
|
|
</head>
|
|
<body>
|
|
|
|
<div class="reveal">
|
|
<div class="slides">
|
|
<section>
|
|
<h1>DataLad<br><small>Distributed data logistics</small></h1>
|
|
<p>Michael Hanke</p>
|
|
<p>
|
|
<small>Institute of Neuroscience and Medicine, Brain & Behavior (INM-7),
|
|
Research Center Jülich</small><br>
|
|
<small>Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf</small></br>
|
|
<p><img style="height:50px;margin-bottom:-12px;margin-right:10px" data-src="common/img/mastodon.svg" />@mih@mas.to
|
|
<a href="http://psychoinformatics.de">http://psychoinformatics.de</a></p>
|
|
<p style="margin-top:50px"><img style="height:100px;margin-right:100px" data-src="common/img/fzj_logo.svg" />
|
|
<img style="height:100px" data-src="common/img/hhu_logo.svg" /></p>
|
|
<a href="https://creativecommons.org/licenses/by/4.0">
|
|
<img data-src="img/cc-by.svg" />
|
|
</a>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
<!-- .element: height="500" -->
|
|
|
|
- Free and open-source software (MIT)
|
|
- Continuously developed since 12 years, as an international collaboration
|
|
- Numerous topical (third-party) extension packages
|
|
|
|
https://helmholtz.software/software/datalad
|
|
|
|
<aside class="notes">
|
|
But let's not talk about it, and only talk about feature and example implementations in DataLad
|
|
</aside>
|
|
</script>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-markdown><script type="text/template">
|
|
## What DataLad can help with?
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Access an ecosystem of cyberinfrastructure
|
|

|
|
|
|
Vast majority is covered. Easy to add additional support with independent efforts.
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Remote-Process "cannot-move" Data
|
|

|
|
|
|
Enables utilization of data resources that cannot be handed out for legal, technical or other reasons.
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Reproducible HPC workflows
|
|

|
|
|
|
Enhances trust in computational outcomes through automatically verified reproducibility, even for users that have no access to the original compute resources.
|
|
|
|
<note>Wagner, Waite, Wierzba, Hoffstaedter, Waite, Poldrack, Eickhoff, Hanke (2022). FAIRly big: A framework for computationally reproducible processing of large-scale data. Scientific Data, 9, 80.</note>
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Reproducible publications
|
|
|
|
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/nhLqmF58SLQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
|
|
|
- Oldest example: Peer-reviewed paper published in Behavior Research Methods in 2020<br>[[DOI 10.3758/s13428-020-01428-x](https://doi.org/10.3758/s13428-020-01428-x)]<!-- .element: style="font-size:70%" -->
|
|
|
|
- See http://handbook.datalad.org/r.html?reproducible-paper and https://youtube.com/datalad
|
|
|
|
<!-- .element: style="font-size:70%" -->
|
|
Note:
|
|
- VERY useful prior publication
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Automated data catalogs
|
|
<!-- .element: style="width:49%" -->
|
|
<!-- .element: style="width:49%" -->
|
|
|
|
Improves (global) findability, populated from existing metadata
|
|
<note>Example: https://data.sfb1451.de</note>
|
|
</script></section>
|
|
</section>
|
|
|
|
<section>
|
|
<section data-markdown><script type="text/template">
|
|
## How does this work?
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Exhaustive tracking of research components
|
|
<!-- .element: width="100%" -->
|
|
Well-structured datasets (using community standards), and portable computational environments — and their evolution — are the precondition for reproducibility
|
|
|
|
<table width=100% style="padding:0px">
|
|
<tr><td style="padding:0px">
|
|
<code><pre>
|
|
# turn any directory into a dataset
|
|
# with version control
|
|
|
|
% datalad create <directory>
|
|
</pre></code>
|
|
</td><td style="padding:0px">
|
|
<code><pre>
|
|
# save a new state of a dataset with
|
|
# file content of any size
|
|
|
|
% datalad save
|
|
</pre></code>
|
|
</td></tr></table>
|
|
Note:
|
|
- link to prev. statements on description standards
|
|
- your community could be really small (your lab), when data are precious resources
|
|
will be spent to understand it, but information must be capture to make this possible
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Capture computational provenance
|
|
<!-- .element: width="100%" -->
|
|
Which data was needed at which version, as input into which code, running with what parameterization in which
|
|
computional environment, to generate an outcome?
|
|
|
|
<table width=100% style="padding:0px">
|
|
<tr><td style="padding:0px">
|
|
<code><pre>
|
|
# execute any command and capture its output
|
|
# while recording all input versions too
|
|
|
|
% datalad run --input ... --output ... <command>
|
|
</pre></code>
|
|
</td></tr></table>
|
|
|
|
Note:
|
|
The missing link: even when everything is shared, we still don't know how to start.
|
|
README is minimum, but executable prov-records are much better.
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Exhaustive capture enables portability
|
|
<!-- .element: width="100%" -->
|
|
Precise identification of data and computational environments, combined for provenance records form a comprehensive and portable data structure, capturing all aspects of an investigation.
|
|
|
|
<table width=100% style="padding:0px">
|
|
<tr><td style="padding:0px">
|
|
<code><pre>
|
|
# transfer data and metadata to other sites and services
|
|
# with fine-grained access control for dataset components
|
|
|
|
% datalad push --to <site-or-service>
|
|
</pre></code>
|
|
</td></tr></table>
|
|
|
|
Note:
|
|
Does it fly? Can you give it to someone? Or can you take it with you to your new lab?
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Reproducibility strengthens trust
|
|
<!-- .element: width="100%" -->
|
|
Outcomes of computational transformations can be validated by authorized 3rd-parties. This enables audits, promotes accountability, and streamlines automated "upgrades" of outputs
|
|
|
|
<table width=100% style="padding:0px">
|
|
<tr><td style="padding:0px">
|
|
<code><pre>
|
|
# obtain dataset (initially only identity,
|
|
# availability, and provenance metadata)
|
|
|
|
% datalad clone <url>
|
|
</pre></code>
|
|
</td><td style="padding:0px">
|
|
<code><pre>
|
|
# immediately actionable provenance records
|
|
# full abstraction of input data retrieval
|
|
|
|
% datalad rerun <commit|tag|range>
|
|
</pre></code>
|
|
</td></tr></table>
|
|
Note:
|
|
Goal is automated reproducibility, enables assessment of robustness and benchmarking algorithmic developments
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Ultimate goal: (re-)usability
|
|
<!-- .element: width="100%" -->
|
|
Verifiable, portable, self-contained data structures that track all aspects of an investigation exhaustively can be (re-)used as modular components in larger contexts — propagating their traits
|
|
|
|
<table width=100% style="padding:0px">
|
|
<tr><td style="padding:0px">
|
|
<code><pre>
|
|
# declare a dependency on another dataset and
|
|
# re-use it a particular state in a new context
|
|
|
|
% datalad clone -d <superdataset> <url> <path-in-dataset>
|
|
</pre></code>
|
|
</td></tr></table>
|
|
|
|
Note:
|
|
With these in place, re-usability is a small(er) step
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## DataLad: Manage (co-)evolution of digital objects
|
|
<!-- .element: width="900" style="margin-bottom:-70px;margin-top:-20px" -->
|
|
|
|
Consume, create, curate, analyze, publish, and query data with full provenance capture and "universal" metadata support.
|
|
<p style="font-size:70%;margin-top:-20px">
|
|
DataLad is free and open source (MIT-licensed). http://datalad.org
|
|
</p>
|
|
|
|
<note>
|
|
Halchenko, Meyer, Poldrack, ... & Hanke, M. (2021).
|
|
DataLad: distributed system for joint management of code, data, and their relationship.
|
|
Journal of Open Source Software, 6(63), 3262.
|
|
</note>
|
|
Note:
|
|
- following illustrations contain concrete implementation with datalad
|
|
- Software developed to address the needs of long-term maintenance and collab on the stufyforrest dataset
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Talk is cheap, show me the code: Git vs. DataLad
|
|
|
|
<iframe width="1120" height="630" src="https://www.youtube-nocookie.com/embed/Yrg6DgOcbPE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
|
|
|
https://www.youtube.com/watch?v=Yrg6DgOcbPE
|
|
|
|
<aside class="notes">
|
|
- show git limits: commit a change in a 3rd-level submodule
|
|
- show annex limits: get file in a subdataset
|
|
- reveal: datalad makes repo-boundaries vanish -- show save -r
|
|
</aside>
|
|
</script></section>
|
|
</section>
|
|
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Extensive documentation and training materials
|
|
<!-- .element: width="700" style="margin-top:-20px;margin-bottom:-10px" -->
|
|
|
|
https://handbook.datalad.org (or ISBN 979-8857037973)
|
|
|
|
- **educational materials** on technologies — **targeting researchers**, not developers (executable paper, student surpervisor workflow,
|
|
...)
|
|
- handbook on concepts, workflows, and use cases
|
|
- **weekly public (virtual) office hour**
|
|
|
|
Note:
|
|
RDM Education is key. Handbook helps people be more productive, yielding more FAIR resources as an outcome, but not as the main goal.
|
|
</script></section>
|
|
|
|
<section>
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Machine-driven metadata reporting
|
|
|
|
<!-- .element: style="height:650px;margin-bottom:-30px" -->
|
|
|
|
Formal "open-world" model, query and validated submission<br>
|
|
RDF-compatible *and* simultaneously scripting-ready<br>
|
|
</script></section>
|
|
|
|
<section data-markdown data-transition="none"><script type="text/template">
|
|
## Manual annotation and description
|
|
|
|
<video data-autoplay width="1280" height="720" controls loop>
|
|
<source src="vid/annotate_demo.webm" type="video/webm">
|
|
</video>
|
|
|
|
Preview a live editor: https://annotate.trr379.de/s/demo
|
|
</script></section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|
## Full-stack RDM solution
|
|

|
|
|
|
See https://atris.fz-juelich.de for a FZJ Forgejo-Aneksajo deployment
|
|
</script></section>
|
|
</section>
|
|
|
|
<section data-markdown><script type="text/template">
|
|

|
|
|
|
https://distribits.live
|
|
</script></section>
|
|
|
|
<section>
|
|
<h2>DataLad contact and more information</h2>
|
|
<table>
|
|
<tr><td>Website + Demos</td>
|
|
<td><a href="http://datalad.org">http://datalad.org</a></td>
|
|
</tr><tr><td>Documentation</td>
|
|
<td><a href="http://handbook.datalad.org">http://handbook.datalad.org</a></td>
|
|
</tr><tr><td>Talks and tutorials</td>
|
|
<td><a href="https://youtube.com/datalad">https://youtube.com/datalad</a></td>
|
|
</tr><tr><td>Development</td>
|
|
<td><a href="http://github.com/datalad">http://github.com/datalad</a></td>
|
|
</tr><tr><td>Support</td>
|
|
<td><a href="https://matrix.to/#/#datalad:matrix.org">https://matrix.to/#/#datalad:matrix.org</a></td>
|
|
</tr><tr><td>Open data</td>
|
|
<td><a href="http://datasets.datalad.org">http://datasets.datalad.org</a></td>
|
|
</tr>
|
|
</tr><tr><td>Mastodon</td>
|
|
<td>@datalad@fosstodon.org</td>
|
|
</tr>
|
|
</table>
|
|
</section>
|
|
</div> <!-- /.slides -->
|
|
</div> <!-- /.reveal -->
|
|
|
|
<script src="common/reveal.js/js/reveal.js"></script>
|
|
|
|
<script>
|
|
// Full list of configuration options available at:
|
|
// https://github.com/hakimel/reveal.js#configuration
|
|
Reveal.initialize({
|
|
// The "normal" size of the presentation, aspect ratio will be preserved
|
|
// when the presentation is scaled to fit different resolutions. Can be
|
|
// specified using percentage units.
|
|
width: 1280,
|
|
height: 960,
|
|
|
|
// Factor of the display size that should remain empty around the content
|
|
margin: 0.1,
|
|
|
|
// Bounds for smallest/largest possible scale to apply to content
|
|
minScale: 0.2,
|
|
maxScale: 1.0,
|
|
|
|
controls: true,
|
|
progress: true,
|
|
history: true,
|
|
center: true,
|
|
|
|
transition: 'slide', // none/fade/slide/convex/concave/zoom
|
|
|
|
// Optional reveal.js plugins
|
|
dependencies: [
|
|
{ src: 'common/reveal.js/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
|
|
{ src: 'common/reveal.js/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
|
{ src: 'common/reveal.js/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
|
|
{ src: 'common/reveal.js/plugin/zoom-js/zoom.js', async: true },
|
|
{ src: 'common/reveal.js/plugin/notes/notes.js', async: true }
|
|
]
|
|
});
|
|
</script>
|
|
</body>
|
|
</html>
|