Skip to content

MatbenchBenchmark

Bases: MSONable, MSONable2File

The core class for benchmarking with Matbench.

MatbenchBenchmark is capable of benchmarking and validating arbitrary materials science benchmarks. It is a container class for sets of MatbenchTasks, objects which provide predetermined sets of training/validation and testing data for any algorithm to benchmark with. MatbenchBenchmark can also give summaries of entire complex benchmarks, including access to individual score statistics for each metric.

MatbenchBenchmark can run any benchmark as long as it has a corresponding benchmark name key. Matbench v0.1 ("matbench_v0.1") is the only benchmark currently configured for use with MatbenchBenchmark.

MatbenchBenchmark is capable of running benchmark subsets; for example, only 3 of the 13 available Matbench v0.1 problems.

See the documentation for more details.

Attributes:

Name Type Description
benchmark_name str

The benchmark name, defaults to the original Matbench v0.1 "matbench_v0.1". Should have an associated validation file in order for the MatbenchTasks to work correctly.

metadata dict

The corresponding metadata file for this benchmark, which defines the basic configuration for each task. See matbench_v0.1_validation for an example. Each dataset has the same required keys in order to work correctly.

user_metadata dict

Any metadata about the algorithm or benchmark that the user wants to keep as part of the benchmark file.

tasks_map {str

MatbenchTask}): A mapping of task name to the corresponding MatbenchTask object.

<<task_names>> MatbenchTask

Access any task obj via MatbenchTask.<>. For example:

mb = MatbenchBenchmark() mb.matbench_dielectric

<>

Source code in matbench/bench.py
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
class MatbenchBenchmark(MSONable, MSONable2File):
    """The core class for benchmarking with Matbench.

    MatbenchBenchmark is capable of benchmarking and validating arbitrary
    materials science benchmarks. It is a container class for sets of
    MatbenchTasks, objects which provide predetermined sets of
    training/validation and testing data for any algorithm to benchmark
    with. MatbenchBenchmark can also give summaries of entire complex
    benchmarks, including access to individual score statistics for
    each metric.

    MatbenchBenchmark can run any benchmark as long as it has a corresponding
    benchmark name key. Matbench v0.1 ("matbench_v0.1") is the only benchmark
    currently configured for use with MatbenchBenchmark.

    MatbenchBenchmark is capable of running benchmark subsets; for example,
    only 3 of the 13 available Matbench v0.1 problems.

    See the documentation for more details.

    Attributes:
        benchmark_name (str): The benchmark name, defaults to the original
            Matbench v0.1 "matbench_v0.1". Should have an associated
            validation file in order for the MatbenchTasks to work
            correctly.
        metadata (dict): The corresponding metadata file for this benchmark,
            which defines the basic configuration for each task. See
            matbench_v0.1_validation for an example. Each dataset
            has the same required keys in order to work correctly.
        user_metadata (dict): Any metadata about the algorithm or benchmark
            that the user wants to keep as part of the benchmark file.
        tasks_map ({str: MatbenchTask}): A mapping of task name to the
            corresponding MatbenchTask object.

        <<task_names>> (MatbenchTask): Access any task obj via
            MatbenchTask.<<task_name>>. For example:

            mb = MatbenchBenchmark()
            mb.matbench_dielectric

            <<MatbenchTask object>>
    """

    # For serialization
    _VERSION_KEY = "version"
    _BENCHMARK_KEY = "benchmark_name"
    _USER_METADATA_KEY = "user_metadata"
    _TASKS_KEY = "tasks"
    _DATESTAMP_KEY = "datestamp"
    _DATESTAMP_FMT = "%Y.%m.%d %H:%M.%S"
    _HASH_KEY = "hash"

    # For class usage
    ALL_KEY = "all"

    def __init__(self, benchmark=MBV01_KEY, autoload=False, subset=None):
        """

        Args:
            benchmark (str): The name of the benchmark. Only supported benchmark
                currently is "matbench_v0.1", though more will be added in the
                future.
            autoload (bool): If True, automatically load the dataset into memory
                For a full benchmark, this can take some time. If False, you'll
                need to load each task with .load before you can access the raw
                data.
            subset ([str]): A list of task names to use as a subset of a full
                benchmark. Only the named tasks will be contained in the class.
                Must correspond to the metadata file defined by the benchmark
                name.
        """

        if benchmark == MBV01_KEY:
            self.benchmark_name = MBV01_KEY
            self.metadata = mbv01_metadata
        else:
            raise ValueError(
                f"Only '{MBV01_KEY}' available. No other benchmarks defined!"
            )

        if subset:
            not_datasets = [k for k in subset if k not in self.metadata]
            if not_datasets:
                raise KeyError(
                    f"Some tasks in {subset} are not benchmark="
                    f"'{self.benchmark_name}' datasets! Remove {not_datasets}."
                )
            else:
                available_tasks = subset
        else:
            available_tasks = self.metadata.keys()

        self.user_metadata = {}
        self.tasks_map = RecursiveDotDict()

        for ds in available_tasks:
            self.tasks_map[ds] = MatbenchTask(
                ds, autoload=autoload, benchmark=self.benchmark_name
            )

        logger.info(
            f"Initialized benchmark '{benchmark}' "
            f"with {len(available_tasks)} tasks: \n"
            f"{pprint.pformat(list(available_tasks))}"
        )

    def __getattr__(self, item):
        """
        Enable MatbenchBenchmark.task_name behavior.

        Args:
            item (str): The name of the attr.

        Returns:

            (object): The attr, if not in the metadata defined by the benchmark
                If the attr is a task name, returns that MatBenchTask object.

        """
        if item in self.metadata:
            return self.tasks_map[item]
        else:
            return self.__getattribute__(item)

    @classmethod
    def from_preset(cls, benchmark, preset_name, autoload=False):
        """
        The following presets are defined for each benchmark:

        benchmark: 'matbench_v0.1':

            - preset: 'structure' - Only structure problems
            - preset: 'composition' - Only composition problems
            - preset: 'regression' - Only regression problems
            - preset: 'classification' - Only classification problems
            - preset: 'all' - All problems in matbench v0.1

        Args:
            benchmark (str): Name of the benchmark set you'd like to use. The
                only supported benchmark set currently is "matbench_v0.1"
            preset_name (str): The name of the preset
            autoload (bool): If true, automatically loads all the datasets
                upon instantiation. Be warned; this can take a while.

        Returns:
            (MatbenchBenchmark object): A ready-to-use MatbenchBenchmark
                object.

        """
        if benchmark == MBV01_KEY:
            if preset_name == STRUCTURE_KEY:
                available_tasks = [
                    k
                    for k, v in mbv01_metadata.items()
                    if v.input_type == STRUCTURE_KEY
                ]
            elif preset_name == COMPOSITION_KEY:
                available_tasks = [
                    k
                    for k, v in mbv01_metadata.items()
                    if v.input_type == COMPOSITION_KEY
                ]
            elif preset_name == REG_KEY:
                available_tasks = [
                    k for k, v in mbv01_metadata.items() if v.task_type == REG_KEY
                ]
            elif preset_name == CLF_KEY:
                available_tasks = [
                    k for k, v in mbv01_metadata.items() if v.task_type == CLF_KEY
                ]
            elif preset_name == cls.ALL_KEY:
                available_tasks = [k for k, v in mbv01_metadata.items()]
            else:
                valid_keys = [
                    STRUCTURE_KEY,
                    COMPOSITION_KEY,
                    CLF_KEY,
                    REG_KEY,
                    cls.ALL_KEY,
                ]
                raise ValueError(
                    f"Preset name '{preset_name}' not recognized for "
                    f"benchmark '{MBV01_KEY}'! Select from "
                    f"{valid_keys}"
                )
        else:
            raise ValueError(
                f"Only '{MBV01_KEY}' available. No other benchmarks defined!"
            )

        return cls(benchmark=benchmark, autoload=autoload, subset=available_tasks)

    @classmethod
    def from_dict(cls, d):
        """Create a MatbenchBenchmark object from a dictionary.

        Args:
            d (dict): The benchmark as a dictionary.

        Returns:
            (MatbenchBenchmark): The benchmark as an object.

        """
        required_keys = [
            "@module",
            "@class",
            cls._VERSION_KEY,
            cls._BENCHMARK_KEY,
            cls._TASKS_KEY,
            cls._USER_METADATA_KEY,
            cls._DATESTAMP_KEY,
            cls._HASH_KEY,
        ]

        missing_keys = []
        for k in required_keys:
            if k not in d:
                missing_keys.append(k)

        extra_keys = []
        for k in d:
            if k not in required_keys:
                extra_keys.append(k)

        if missing_keys and not extra_keys:
            raise ValueError(
                f"Required keys {missing_keys} for {cls.__class__.__name__} "
                f"not found!"
            )
        elif not missing_keys and extra_keys:
            raise ValueError(
                f"Extra keys {extra_keys} for {cls.__class__.__name__} " f"present!"
            )
        elif missing_keys and extra_keys:
            raise ValueError(
                f"Missing required keys {missing_keys} and extra keys "
                f"{extra_keys} present!"
            )

        # Check all tasks to make sure their benchmark name is matching in the
        # benchmark and in the tasks
        not_matching_bench = []
        for t_dict in d[cls._TASKS_KEY].values():
            if t_dict[MatbenchTask._BENCHMARK_KEY] != d[cls._BENCHMARK_KEY]:
                not_matching_bench.append(t_dict[MatbenchTask._DATASET_KEY])
        if not_matching_bench:
            raise ValueError(
                f"Tasks {not_matching_bench} do not have a benchmark name "
                f"matching the benchmark ({d[cls._BENCHMARK_KEY]})!"
            )

        # Ensure the hash is matching, i.e., the data was not modified after
        # matbench got done with it
        m_from_dict = d.pop(cls._HASH_KEY)
        m = hash_dictionary(d)
        if m != m_from_dict:
            raise ValueError(
                f"Hash of dictionary does not match it's reported value! {m} "
                f"!= {m_from_dict} . Was the data modified after saving?)"
            )

        # Check to see if any tasks have task names not matching their key
        # names in the benchmark
        not_matching_tasks = []
        for task_name, task_info in d[cls._TASKS_KEY].items():
            key_as_per_task = task_info[MatbenchTask._DATASET_KEY]
            if task_name != key_as_per_task:
                not_matching_tasks.append((task_name, key_as_per_task))
        if not_matching_tasks:
            raise ValueError(
                f"Task names in benchmark and task names in tasks not "
                f"matching: {not_matching_tasks}"
            )

        # Warn if versions are not matching
        if d[cls._VERSION_KEY] != VERSION:
            logger.warning(
                f"Warning! Versions not matching: "
                f"(data file has version {d[cls._VERSION_KEY]}, "
                f"this package is {VERSION})."
            )

        return cls._from_args(
            benchmark_name=d[cls._BENCHMARK_KEY],
            tasks_dict=d[cls._TASKS_KEY],
            user_metadata=d[cls._USER_METADATA_KEY],
        )

    @classmethod
    def _from_args(cls, benchmark_name, tasks_dict, user_metadata):
        """Create a MatbenchBenchmark object from arguments

        Args:
            benchmark_name (str): name of the benchmark
            tasks_dict (dict): formatted dict of task data
            user_metadata (dict): freeform user metadata

        Returns:
            (MatbenchBenchmark)
        """
        subset = list(tasks_dict.keys())
        obj = cls(benchmark=benchmark_name, autoload=False, subset=subset)
        obj.tasks_map = RecursiveDotDict(
            {
                t_name: MatbenchTask.from_dict(t_dict)
                for t_name, t_dict in tasks_dict.items()
            }
        )

        logger.warning(
            "To add new data to this benchmark, the "
            "benchmark must be loaded with .load(). Alternatively, "
            "load individual tasks with MatbenchTask.load()."
        )

        # MatbenchTask automatically validates files during its from_dict
        obj.user_metadata = user_metadata

        logger.debug(f"Successfully converted dict/args to '{cls.__name__}'.")

        return obj

    def _determine_completeness(self, completeness_type):
        """Determine the completeness of this benchmark.

        Completeness means the tasks are included (but not
        necessarily recorded yet) in the benchmark.

        Supported completeness types are:
        - "all": All tasks are included
        - "composition": All composition tasks are included
        - "structure": All structure tasks are included
        - "regression": All regression problems
        - "classification": All classification problems

        Args:
            completeness_type (str): One of the above completeness
                types.

        Returns:
            (bool) True if this benchmark object is complete
                with respect to the completeness type.

        """
        if completeness_type == self.ALL_KEY:
            required_tasks = list(self.metadata.keys())
        elif completeness_type in (COMPOSITION_KEY, STRUCTURE_KEY):
            required_tasks = [
                k
                for k, v in self.metadata.items()
                if v.input_type == completeness_type
            ]
        elif completeness_type in (REG_KEY, CLF_KEY):
            required_tasks = [
                k
                for k, v in self.metadata.items()
                if v.task_type == completeness_type
            ]
        else:
            allowed_completeness_types = [
                self.ALL_KEY,
                COMPOSITION_KEY,
                STRUCTURE_KEY,
                REG_KEY,
                CLF_KEY,
            ]
            raise ValueError(
                "Only supported completeness types are "
                f"{allowed_completeness_types}"
            )

        for task in required_tasks:
            if task not in self.tasks_map:
                return False
        else:
            return True

    def as_dict(self):
        """Overridden from MSONable.as_dict, get dict repr of this obj

        Returns:
            d (dict): the object as a dictionary.

        """
        tasksd = {mbt.dataset_name: mbt.as_dict() for mbt in self.tasks}
        tasksd_jsonable = immutify_dictionary(tasksd)

        d = {
            "@module": self.__class__.__module__,
            "@class": self.__class__.__name__,
            self._VERSION_KEY: VERSION,
            self._TASKS_KEY: tasksd_jsonable,
            self._USER_METADATA_KEY: self.user_metadata,
            self._BENCHMARK_KEY: self.benchmark_name,
            self._DATESTAMP_KEY: datetime.datetime.utcnow().strftime(
                self._DATESTAMP_FMT
            ),
        }

        # to obtain a hash for this benchmark, immutify the dictionary
        # and then stringify it
        d[self._HASH_KEY] = hash_dictionary(d)
        logger.debug(
            f"Successfully converted {self.__class__.__name__} to dictionary."
        )
        return d

    def get_info(self):
        """Log info about the benchmark to the respective logging handlers.

        Returns:
            (NoneType): Output is sent to logger.
        """
        logger.info(self.info)

    def add_metadata(self, metadata):
        """Add freeform information about this run to the object
        (and subsequent json), accessible thru the
        'user_metadata' attr.


        All keys must be strings.

        All values must be either:
            a. a numpy ndarray
            b. python native types, such as bools, floats, ints, strs
            c. a pandas series
            d. a list/tuple of python native types (bools, floats, ints)

            OR

            e. A dictionary where all keys are strs and all values
               are one of a, b, c, d, or e (recursive).

        Args:
            metadata (dict): Metadata about the algorithm being
                run on this benchmark.

        Returns:
            (NoneType): None. Logger provides information.
        """
        # Use logging here so bad metadata addition does not
        # ruin an entire run...
        if not isinstance(metadata, dict):
            logger.critical(
                f"User metadata must be reducible to dict format, "
                f"not type({type(metadata)})"
            )
            logger.info("User metadata not added.")

        else:
            if self.user_metadata:
                logger.warning("User metadata already exists! Overwriting...")

            self.user_metadata = immutify_dictionary(metadata)
            logger.info("User metadata added successfully!")

    def load(self):
        """Load all tasks in this benchmark.
        Returns:
            (NoneType): Datasets are kept in attributes.
        """
        for t in self.tasks:
            t.load()

    def validate(self):
        """Run validation on each task in this benchmark.

        Returns:
            ({str: str}): dict of errors, if they exist

        """
        errors = {}
        for t, t_obj in self.tasks_map.items():
            try:
                t_obj.validate()
            except BaseException:
                errors[t] = traceback.format_exc()
        return errors

    @property
    def tasks(self):
        """Return the tasks as a list.

        Returns:
            ([MatbenchTask]): A list of matbench tasks in this benchmark
        """
        return self.tasks_map.values()

    @property
    def scores(self):
        """Get all score metrics for all tasks as a dictionary.

        Returns:
            (RecursiveDotDict): A nested dictionary-like object of scores
                for each task.

        """
        return RecursiveDotDict({t.dataset_name: t.scores for t in self.tasks})

    @property
    def info(self):
        """Get a formatted string of info about this benchmark and its current
        state.

        Returns:
            s (str): A formatted string describing this benchmark's state.

        """

        complete = self.is_complete
        recorded = self.is_recorded
        valid = self.is_valid

        s = ""
        s += (
            f"\nMatbench package {VERSION} running benchmark "
            f"'{self.benchmark_name}'"
        )
        s += f"\n\tis complete: {complete}"
        s += f"\n\tis recorded: {recorded}"
        s += f"\n\tis valid: {valid}"

        if not recorded:
            s += (
                "\n\n Benchmark is not fully recorded; limited information " "shown."
            )
        if not valid:
            s += "\n\n Benchmark is not valid; limited information shown."

        if not valid or not recorded:
            s += "\n\nTasks:"
            for t in self.tasks_map.values():
                s += f"\n\t- '{t.dataset_name}: recorded={t.all_folds_recorded}"

        if valid and recorded:
            s += "\n\nResults:"
            for t in self.tasks:

                if t.metadata.task_type == REG_KEY:
                    score_text = (
                        f"MAE mean: " f"{self.scores[t.dataset_name].mae.mean}"
                    )
                else:
                    score_text = (
                        f"ROCAUC mean: " f"{self.scores[t.dataset_name].rocauc.mean}"
                    )
                s += f"\n\t- '{t.dataset_name}' {score_text}"

        return s

    @property
    def is_complete(self):
        """Determine if all available tasks are included in this benchmark.

        For matbench v0.1, this means all 13 tasks are in the benchmark.

        Returns:
            (bool): Whether benchmark is entirely complete.

        """
        return self._determine_completeness(completeness_type=self.ALL_KEY)

    @property
    def is_composition_complete(self):
        """Determine if all composition tasks for this benchmark are included

        Returns:
            (bool): Whether benchmark is composition complete.
        """
        return self._determine_completeness(completeness_type=COMPOSITION_KEY)

    @property
    def is_structure_complete(self):
        """Determine if all structure tasks for this benchmark are included

        Returns:
            (bool): Whether benchmark is structure complete.
        """
        return self._determine_completeness(completeness_type=STRUCTURE_KEY)

    @property
    def is_regression_complete(self):
        """Determine if all regression tasks for this benchmark are included

        Returns:
            (bool): Whether benchmark is regression complete.
        """
        return self._determine_completeness(completeness_type=REG_KEY)

    @property
    def is_classification_complete(self):
        """Determine if all classification tasks for this benchmark are included

        Returns:
            (bool): Whether benchmark is classification complete.
        """
        return self._determine_completeness(completeness_type=CLF_KEY)

    @property
    def is_recorded(self):
        """All tasks in this benchmark (whether or not it includes all tasks in
        the benchmark set) are recorded.

        Returns:
            (bool): True if all tasks (even if only a subset of all matbench)
            for this benchmark are recorded.

        """
        return all([t.all_folds_recorded for t in self.tasks_map.values()])

    @property
    def is_valid(self):
        """Checks all tasks are recorded and valid, as per each task's
        validation procedure.

        Can take some time, especially if the tasks are not already
        loaded into memory.

        Returns:
            (bool): True if all tasks are valid
        """
        errors = self.validate()
        if errors:
            formatted_errors = pprint.pformat(errors)
            logger.critical(
                f"Benchmark has errors! " f"Errors:\n {formatted_errors}"
            )
            return False
        else:
            return True

__getattr__(item)

Enable MatbenchBenchmark.task_name behavior.

Parameters:

Name Type Description Default
item str

The name of the attr.

required

Returns:

Type Description
object

The attr, if not in the metadata defined by the benchmark If the attr is a task name, returns that MatBenchTask object.

Source code in matbench/bench.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def __getattr__(self, item):
    """
    Enable MatbenchBenchmark.task_name behavior.

    Args:
        item (str): The name of the attr.

    Returns:

        (object): The attr, if not in the metadata defined by the benchmark
            If the attr is a task name, returns that MatBenchTask object.

    """
    if item in self.metadata:
        return self.tasks_map[item]
    else:
        return self.__getattribute__(item)

__init__(benchmark=MBV01_KEY, autoload=False, subset=None)

Parameters:

Name Type Description Default
benchmark str

The name of the benchmark. Only supported benchmark currently is "matbench_v0.1", though more will be added in the future.

MBV01_KEY
autoload bool

If True, automatically load the dataset into memory For a full benchmark, this can take some time. If False, you'll need to load each task with .load before you can access the raw data.

False
subset [str]

A list of task names to use as a subset of a full benchmark. Only the named tasks will be contained in the class. Must correspond to the metadata file defined by the benchmark name.

None
Source code in matbench/bench.py
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def __init__(self, benchmark=MBV01_KEY, autoload=False, subset=None):
    """

    Args:
        benchmark (str): The name of the benchmark. Only supported benchmark
            currently is "matbench_v0.1", though more will be added in the
            future.
        autoload (bool): If True, automatically load the dataset into memory
            For a full benchmark, this can take some time. If False, you'll
            need to load each task with .load before you can access the raw
            data.
        subset ([str]): A list of task names to use as a subset of a full
            benchmark. Only the named tasks will be contained in the class.
            Must correspond to the metadata file defined by the benchmark
            name.
    """

    if benchmark == MBV01_KEY:
        self.benchmark_name = MBV01_KEY
        self.metadata = mbv01_metadata
    else:
        raise ValueError(
            f"Only '{MBV01_KEY}' available. No other benchmarks defined!"
        )

    if subset:
        not_datasets = [k for k in subset if k not in self.metadata]
        if not_datasets:
            raise KeyError(
                f"Some tasks in {subset} are not benchmark="
                f"'{self.benchmark_name}' datasets! Remove {not_datasets}."
            )
        else:
            available_tasks = subset
    else:
        available_tasks = self.metadata.keys()

    self.user_metadata = {}
    self.tasks_map = RecursiveDotDict()

    for ds in available_tasks:
        self.tasks_map[ds] = MatbenchTask(
            ds, autoload=autoload, benchmark=self.benchmark_name
        )

    logger.info(
        f"Initialized benchmark '{benchmark}' "
        f"with {len(available_tasks)} tasks: \n"
        f"{pprint.pformat(list(available_tasks))}"
    )

add_metadata(metadata)

Add freeform information about this run to the object (and subsequent json), accessible thru the 'user_metadata' attr.

All keys must be strings.

All values must be either

a. a numpy ndarray b. python native types, such as bools, floats, ints, strs c. a pandas series d. a list/tuple of python native types (bools, floats, ints)

OR

e. A dictionary where all keys are strs and all values are one of a, b, c, d, or e (recursive).

Parameters:

Name Type Description Default
metadata dict

Metadata about the algorithm being run on this benchmark.

required

Returns:

Type Description
NoneType

None. Logger provides information.

Source code in matbench/bench.py
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
def add_metadata(self, metadata):
    """Add freeform information about this run to the object
    (and subsequent json), accessible thru the
    'user_metadata' attr.


    All keys must be strings.

    All values must be either:
        a. a numpy ndarray
        b. python native types, such as bools, floats, ints, strs
        c. a pandas series
        d. a list/tuple of python native types (bools, floats, ints)

        OR

        e. A dictionary where all keys are strs and all values
           are one of a, b, c, d, or e (recursive).

    Args:
        metadata (dict): Metadata about the algorithm being
            run on this benchmark.

    Returns:
        (NoneType): None. Logger provides information.
    """
    # Use logging here so bad metadata addition does not
    # ruin an entire run...
    if not isinstance(metadata, dict):
        logger.critical(
            f"User metadata must be reducible to dict format, "
            f"not type({type(metadata)})"
        )
        logger.info("User metadata not added.")

    else:
        if self.user_metadata:
            logger.warning("User metadata already exists! Overwriting...")

        self.user_metadata = immutify_dictionary(metadata)
        logger.info("User metadata added successfully!")

as_dict()

Overridden from MSONable.as_dict, get dict repr of this obj

Returns:

Name Type Description
d dict

the object as a dictionary.

Source code in matbench/bench.py
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
def as_dict(self):
    """Overridden from MSONable.as_dict, get dict repr of this obj

    Returns:
        d (dict): the object as a dictionary.

    """
    tasksd = {mbt.dataset_name: mbt.as_dict() for mbt in self.tasks}
    tasksd_jsonable = immutify_dictionary(tasksd)

    d = {
        "@module": self.__class__.__module__,
        "@class": self.__class__.__name__,
        self._VERSION_KEY: VERSION,
        self._TASKS_KEY: tasksd_jsonable,
        self._USER_METADATA_KEY: self.user_metadata,
        self._BENCHMARK_KEY: self.benchmark_name,
        self._DATESTAMP_KEY: datetime.datetime.utcnow().strftime(
            self._DATESTAMP_FMT
        ),
    }

    # to obtain a hash for this benchmark, immutify the dictionary
    # and then stringify it
    d[self._HASH_KEY] = hash_dictionary(d)
    logger.debug(
        f"Successfully converted {self.__class__.__name__} to dictionary."
    )
    return d

from_dict(d) classmethod

Create a MatbenchBenchmark object from a dictionary.

Parameters:

Name Type Description Default
d dict

The benchmark as a dictionary.

required

Returns:

Type Description
MatbenchBenchmark

The benchmark as an object.

Source code in matbench/bench.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
@classmethod
def from_dict(cls, d):
    """Create a MatbenchBenchmark object from a dictionary.

    Args:
        d (dict): The benchmark as a dictionary.

    Returns:
        (MatbenchBenchmark): The benchmark as an object.

    """
    required_keys = [
        "@module",
        "@class",
        cls._VERSION_KEY,
        cls._BENCHMARK_KEY,
        cls._TASKS_KEY,
        cls._USER_METADATA_KEY,
        cls._DATESTAMP_KEY,
        cls._HASH_KEY,
    ]

    missing_keys = []
    for k in required_keys:
        if k not in d:
            missing_keys.append(k)

    extra_keys = []
    for k in d:
        if k not in required_keys:
            extra_keys.append(k)

    if missing_keys and not extra_keys:
        raise ValueError(
            f"Required keys {missing_keys} for {cls.__class__.__name__} "
            f"not found!"
        )
    elif not missing_keys and extra_keys:
        raise ValueError(
            f"Extra keys {extra_keys} for {cls.__class__.__name__} " f"present!"
        )
    elif missing_keys and extra_keys:
        raise ValueError(
            f"Missing required keys {missing_keys} and extra keys "
            f"{extra_keys} present!"
        )

    # Check all tasks to make sure their benchmark name is matching in the
    # benchmark and in the tasks
    not_matching_bench = []
    for t_dict in d[cls._TASKS_KEY].values():
        if t_dict[MatbenchTask._BENCHMARK_KEY] != d[cls._BENCHMARK_KEY]:
            not_matching_bench.append(t_dict[MatbenchTask._DATASET_KEY])
    if not_matching_bench:
        raise ValueError(
            f"Tasks {not_matching_bench} do not have a benchmark name "
            f"matching the benchmark ({d[cls._BENCHMARK_KEY]})!"
        )

    # Ensure the hash is matching, i.e., the data was not modified after
    # matbench got done with it
    m_from_dict = d.pop(cls._HASH_KEY)
    m = hash_dictionary(d)
    if m != m_from_dict:
        raise ValueError(
            f"Hash of dictionary does not match it's reported value! {m} "
            f"!= {m_from_dict} . Was the data modified after saving?)"
        )

    # Check to see if any tasks have task names not matching their key
    # names in the benchmark
    not_matching_tasks = []
    for task_name, task_info in d[cls._TASKS_KEY].items():
        key_as_per_task = task_info[MatbenchTask._DATASET_KEY]
        if task_name != key_as_per_task:
            not_matching_tasks.append((task_name, key_as_per_task))
    if not_matching_tasks:
        raise ValueError(
            f"Task names in benchmark and task names in tasks not "
            f"matching: {not_matching_tasks}"
        )

    # Warn if versions are not matching
    if d[cls._VERSION_KEY] != VERSION:
        logger.warning(
            f"Warning! Versions not matching: "
            f"(data file has version {d[cls._VERSION_KEY]}, "
            f"this package is {VERSION})."
        )

    return cls._from_args(
        benchmark_name=d[cls._BENCHMARK_KEY],
        tasks_dict=d[cls._TASKS_KEY],
        user_metadata=d[cls._USER_METADATA_KEY],
    )

from_preset(benchmark, preset_name, autoload=False) classmethod

The following presets are defined for each benchmark:

'matbench_v0.1':
  • preset: 'structure' - Only structure problems
  • preset: 'composition' - Only composition problems
  • preset: 'regression' - Only regression problems
  • preset: 'classification' - Only classification problems
  • preset: 'all' - All problems in matbench v0.1

Parameters:

Name Type Description Default
benchmark str

Name of the benchmark set you'd like to use. The only supported benchmark set currently is "matbench_v0.1"

required
preset_name str

The name of the preset

required
autoload bool

If true, automatically loads all the datasets upon instantiation. Be warned; this can take a while.

False

Returns:

Type Description
MatbenchBenchmark object

A ready-to-use MatbenchBenchmark object.

Source code in matbench/bench.py
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
@classmethod
def from_preset(cls, benchmark, preset_name, autoload=False):
    """
    The following presets are defined for each benchmark:

    benchmark: 'matbench_v0.1':

        - preset: 'structure' - Only structure problems
        - preset: 'composition' - Only composition problems
        - preset: 'regression' - Only regression problems
        - preset: 'classification' - Only classification problems
        - preset: 'all' - All problems in matbench v0.1

    Args:
        benchmark (str): Name of the benchmark set you'd like to use. The
            only supported benchmark set currently is "matbench_v0.1"
        preset_name (str): The name of the preset
        autoload (bool): If true, automatically loads all the datasets
            upon instantiation. Be warned; this can take a while.

    Returns:
        (MatbenchBenchmark object): A ready-to-use MatbenchBenchmark
            object.

    """
    if benchmark == MBV01_KEY:
        if preset_name == STRUCTURE_KEY:
            available_tasks = [
                k
                for k, v in mbv01_metadata.items()
                if v.input_type == STRUCTURE_KEY
            ]
        elif preset_name == COMPOSITION_KEY:
            available_tasks = [
                k
                for k, v in mbv01_metadata.items()
                if v.input_type == COMPOSITION_KEY
            ]
        elif preset_name == REG_KEY:
            available_tasks = [
                k for k, v in mbv01_metadata.items() if v.task_type == REG_KEY
            ]
        elif preset_name == CLF_KEY:
            available_tasks = [
                k for k, v in mbv01_metadata.items() if v.task_type == CLF_KEY
            ]
        elif preset_name == cls.ALL_KEY:
            available_tasks = [k for k, v in mbv01_metadata.items()]
        else:
            valid_keys = [
                STRUCTURE_KEY,
                COMPOSITION_KEY,
                CLF_KEY,
                REG_KEY,
                cls.ALL_KEY,
            ]
            raise ValueError(
                f"Preset name '{preset_name}' not recognized for "
                f"benchmark '{MBV01_KEY}'! Select from "
                f"{valid_keys}"
            )
    else:
        raise ValueError(
            f"Only '{MBV01_KEY}' available. No other benchmarks defined!"
        )

    return cls(benchmark=benchmark, autoload=autoload, subset=available_tasks)

get_info()

Log info about the benchmark to the respective logging handlers.

Returns:

Type Description
NoneType

Output is sent to logger.

Source code in matbench/bench.py
440
441
442
443
444
445
446
def get_info(self):
    """Log info about the benchmark to the respective logging handlers.

    Returns:
        (NoneType): Output is sent to logger.
    """
    logger.info(self.info)

info() property

Get a formatted string of info about this benchmark and its current state.

Returns:

Name Type Description
s str

A formatted string describing this benchmark's state.

Source code in matbench/bench.py
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
@property
def info(self):
    """Get a formatted string of info about this benchmark and its current
    state.

    Returns:
        s (str): A formatted string describing this benchmark's state.

    """

    complete = self.is_complete
    recorded = self.is_recorded
    valid = self.is_valid

    s = ""
    s += (
        f"\nMatbench package {VERSION} running benchmark "
        f"'{self.benchmark_name}'"
    )
    s += f"\n\tis complete: {complete}"
    s += f"\n\tis recorded: {recorded}"
    s += f"\n\tis valid: {valid}"

    if not recorded:
        s += (
            "\n\n Benchmark is not fully recorded; limited information " "shown."
        )
    if not valid:
        s += "\n\n Benchmark is not valid; limited information shown."

    if not valid or not recorded:
        s += "\n\nTasks:"
        for t in self.tasks_map.values():
            s += f"\n\t- '{t.dataset_name}: recorded={t.all_folds_recorded}"

    if valid and recorded:
        s += "\n\nResults:"
        for t in self.tasks:

            if t.metadata.task_type == REG_KEY:
                score_text = (
                    f"MAE mean: " f"{self.scores[t.dataset_name].mae.mean}"
                )
            else:
                score_text = (
                    f"ROCAUC mean: " f"{self.scores[t.dataset_name].rocauc.mean}"
                )
            s += f"\n\t- '{t.dataset_name}' {score_text}"

    return s

is_classification_complete() property

Determine if all classification tasks for this benchmark are included

Returns:

Type Description
bool

Whether benchmark is classification complete.

Source code in matbench/bench.py
623
624
625
626
627
628
629
630
@property
def is_classification_complete(self):
    """Determine if all classification tasks for this benchmark are included

    Returns:
        (bool): Whether benchmark is classification complete.
    """
    return self._determine_completeness(completeness_type=CLF_KEY)

is_complete() property

Determine if all available tasks are included in this benchmark.

For matbench v0.1, this means all 13 tasks are in the benchmark.

Returns:

Type Description
bool

Whether benchmark is entirely complete.

Source code in matbench/bench.py
584
585
586
587
588
589
590
591
592
593
594
@property
def is_complete(self):
    """Determine if all available tasks are included in this benchmark.

    For matbench v0.1, this means all 13 tasks are in the benchmark.

    Returns:
        (bool): Whether benchmark is entirely complete.

    """
    return self._determine_completeness(completeness_type=self.ALL_KEY)

is_composition_complete() property

Determine if all composition tasks for this benchmark are included

Returns:

Type Description
bool

Whether benchmark is composition complete.

Source code in matbench/bench.py
596
597
598
599
600
601
602
603
@property
def is_composition_complete(self):
    """Determine if all composition tasks for this benchmark are included

    Returns:
        (bool): Whether benchmark is composition complete.
    """
    return self._determine_completeness(completeness_type=COMPOSITION_KEY)

is_recorded() property

All tasks in this benchmark (whether or not it includes all tasks in the benchmark set) are recorded.

Returns:

Type Description
bool

True if all tasks (even if only a subset of all matbench)

for this benchmark are recorded.

Source code in matbench/bench.py
632
633
634
635
636
637
638
639
640
641
642
@property
def is_recorded(self):
    """All tasks in this benchmark (whether or not it includes all tasks in
    the benchmark set) are recorded.

    Returns:
        (bool): True if all tasks (even if only a subset of all matbench)
        for this benchmark are recorded.

    """
    return all([t.all_folds_recorded for t in self.tasks_map.values()])

is_regression_complete() property

Determine if all regression tasks for this benchmark are included

Returns:

Type Description
bool

Whether benchmark is regression complete.

Source code in matbench/bench.py
614
615
616
617
618
619
620
621
@property
def is_regression_complete(self):
    """Determine if all regression tasks for this benchmark are included

    Returns:
        (bool): Whether benchmark is regression complete.
    """
    return self._determine_completeness(completeness_type=REG_KEY)

is_structure_complete() property

Determine if all structure tasks for this benchmark are included

Returns:

Type Description
bool

Whether benchmark is structure complete.

Source code in matbench/bench.py
605
606
607
608
609
610
611
612
@property
def is_structure_complete(self):
    """Determine if all structure tasks for this benchmark are included

    Returns:
        (bool): Whether benchmark is structure complete.
    """
    return self._determine_completeness(completeness_type=STRUCTURE_KEY)

is_valid() property

Checks all tasks are recorded and valid, as per each task's validation procedure.

Can take some time, especially if the tasks are not already loaded into memory.

Returns:

Type Description
bool

True if all tasks are valid

Source code in matbench/bench.py
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
@property
def is_valid(self):
    """Checks all tasks are recorded and valid, as per each task's
    validation procedure.

    Can take some time, especially if the tasks are not already
    loaded into memory.

    Returns:
        (bool): True if all tasks are valid
    """
    errors = self.validate()
    if errors:
        formatted_errors = pprint.pformat(errors)
        logger.critical(
            f"Benchmark has errors! " f"Errors:\n {formatted_errors}"
        )
        return False
    else:
        return True

load()

Load all tasks in this benchmark.

Returns:

Type Description
NoneType

Datasets are kept in attributes.

Source code in matbench/bench.py
490
491
492
493
494
495
496
def load(self):
    """Load all tasks in this benchmark.
    Returns:
        (NoneType): Datasets are kept in attributes.
    """
    for t in self.tasks:
        t.load()

scores() property

Get all score metrics for all tasks as a dictionary.

Returns:

Type Description
RecursiveDotDict

A nested dictionary-like object of scores for each task.

Source code in matbench/bench.py
522
523
524
525
526
527
528
529
530
531
@property
def scores(self):
    """Get all score metrics for all tasks as a dictionary.

    Returns:
        (RecursiveDotDict): A nested dictionary-like object of scores
            for each task.

    """
    return RecursiveDotDict({t.dataset_name: t.scores for t in self.tasks})

tasks() property

Return the tasks as a list.

Returns:

Type Description
[MatbenchTask]

A list of matbench tasks in this benchmark

Source code in matbench/bench.py
513
514
515
516
517
518
519
520
@property
def tasks(self):
    """Return the tasks as a list.

    Returns:
        ([MatbenchTask]): A list of matbench tasks in this benchmark
    """
    return self.tasks_map.values()

validate()

Run validation on each task in this benchmark.

Returns:

Type Description
{str: str}

dict of errors, if they exist

Source code in matbench/bench.py
498
499
500
501
502
503
504
505
506
507
508
509
510
511
def validate(self):
    """Run validation on each task in this benchmark.

    Returns:
        ({str: str}): dict of errors, if they exist

    """
    errors = {}
    for t, t_obj in self.tasks_map.items():
        try:
            t_obj.validate()
        except BaseException:
            errors[t] = traceback.format_exc()
    return errors