oximachine_featurizer API documentation

The featurization module

Featurization functions for the oxidation state mining project. Wrapper around matminer

class oximachine_featurizer.featurize.FeatureCollector(inpath=None, labelpath=None, outdir_labels='data/labels', outdir_features='data/features', outdir_helper='data/helper', percentage_holdout=0, outdir_holdout=None, forbidden_picklepath=None, exclude_dir=None, selected_features=['local_property_stats', 'column', 'row', 'valenceelectrons', 'diffto18electrons', 'sunfilled', 'punfilled', 'dunfilled', 'crystal_nn_fingerprint'], old_format=False, training_set_size=None, racsfile=None, selectedracs=['D_mc-I-0-all', 'D_mc-I-1-all', 'D_mc-I-2-all', 'D_mc-I-3-all', 'D_mc-S-0-all', 'D_mc-S-1-all', 'D_mc-S-2-all', 'D_mc-S-3-all', 'D_mc-T-0-all', 'D_mc-T-1-all', 'D_mc-T-2-all', 'D_mc-T-3-all', 'D_mc-Z-0-all', 'D_mc-Z-1-all', 'D_mc-Z-2-all', 'D_mc-Z-3-all', 'D_mc-chi-0-all', 'D_mc-chi-1-all', 'D_mc-chi-2-all', 'D_mc-chi-3-all', 'mc-I-0-all', 'mc-I-1-all', 'mc-I-2-all', 'mc-I-3-all', 'mc-S-0-all', 'mc-S-1-all', 'mc-S-2-all', 'mc-S-3-all', 'mc-T-0-all', 'mc-T-1-all', 'mc-T-2-all', 'mc-T-3-all', 'mc-Z-0-all', 'mc-Z-1-all', 'mc-Z-2-all', 'mc-Z-3-all', 'mc-chi-0-all', 'mc-chi-1-all', 'mc-chi-2-all', 'mc-chi-3-all'], drop_duplicates=True)[source]

Bases: object

convert features from a folder of pickle files to three pickle files for feature matrix, label vector and names list.

__init__(inpath=None, labelpath=None, outdir_labels='data/labels', outdir_features='data/features', outdir_helper='data/helper', percentage_holdout=0, outdir_holdout=None, forbidden_picklepath=None, exclude_dir=None, selected_features=['local_property_stats', 'column', 'row', 'valenceelectrons', 'diffto18electrons', 'sunfilled', 'punfilled', 'dunfilled', 'crystal_nn_fingerprint'], old_format=False, training_set_size=None, racsfile=None, selectedracs=['D_mc-I-0-all', 'D_mc-I-1-all', 'D_mc-I-2-all', 'D_mc-I-3-all', 'D_mc-S-0-all', 'D_mc-S-1-all', 'D_mc-S-2-all', 'D_mc-S-3-all', 'D_mc-T-0-all', 'D_mc-T-1-all', 'D_mc-T-2-all', 'D_mc-T-3-all', 'D_mc-Z-0-all', 'D_mc-Z-1-all', 'D_mc-Z-2-all', 'D_mc-Z-3-all', 'D_mc-chi-0-all', 'D_mc-chi-1-all', 'D_mc-chi-2-all', 'D_mc-chi-3-all', 'mc-I-0-all', 'mc-I-1-all', 'mc-I-2-all', 'mc-I-3-all', 'mc-S-0-all', 'mc-S-1-all', 'mc-S-2-all', 'mc-S-3-all', 'mc-T-0-all', 'mc-T-1-all', 'mc-T-2-all', 'mc-T-3-all', 'mc-Z-0-all', 'mc-Z-1-all', 'mc-Z-2-all', 'mc-Z-3-all', 'mc-chi-0-all', 'mc-chi-1-all', 'mc-chi-2-all', 'mc-chi-3-all'], drop_duplicates=True)[source]

Initializes a feature collector.

WARNING! The fingerprint selection function assumes that the full feature vector in the pickle files has the columns as specified in FEATURE_LABELS_ALL

Keyword Arguments
  • inpath (Union[str, Path]) – None)

  • labelpath (Union[str, Path]) – None)

  • outdir_labels (Union[str, Path]) – “data/labels”)

  • outdir_features (Union[str, Path]) – “data/features”)

  • outdir_helper (Union[str, Path]) -- path to output directory for helper files (feature names, structure names) – “data/helper”)

  • percentage_holdout (float) –

  • outdir_holdout (Union[str, Path]) -- directory into which the files for the holdout set are written (names, X and y) –

  • forbidden_picklepath (Union[str, Path]) – None)

  • exclude_dir (Union[str, Path]) – None)

  • selected_features (List[str]) – (default: [“crystal_nn_fingerprint”,”ward_prd”,”bond_orientational”,”behler_parinello”])

  • old_format (bool) – {True})

  • training_set_size (int) –

  • racsfile (str) -- path to file with RACs (pd.DataFrame saved as csv) –

  • selectedracs (List[str]) –

__weakref__

list of weak references to the object (if defined)

static create_dict_for_feature_table(picklefile)[source]

Reads in a pickle with features and returns a list of dictionaries with one dictionary per metal site.

Parameters

picklefile (Union[str, Path]) –

Return type

List[dict]

Returns

List[dict] – list of dicionary

static create_dict_for_feature_table_from_dict(d)[source]

Reads in a pickle with features and returns a list of dictionaries with one dictionary per metal site.

Parameters

d (dict) –

Return type

List[dict]

Returns

List[dict] – list of dicionary

static create_feature_list(picklefiles, forbidden_list, old_format=True)[source]

Reads a list of pickle files into dictionary

Parameters
  • picklefiles (List[Union[str, Path]]) –

  • forbidden_list (list) -- list of "forbidden" names (CSD naming convention) – that will not be used

  • old_format (bool) – “legacy” format. Default: True

Return type

list

Returns

list – parsed pickle contents

dump_featurecollection()[source]

Collect features and write features, labels and names to seperate files

Return type

None

static make_labels_table(raw_labels)[source]

Read raw labeling output into a dictionary format that can be used to construct pd.DataFrames

Warning: assumes that each metal in the structure has the same oxidation states as it takes the first list element. Cases in which this is not fulfilled need to be filtered out earlier.

Parameters

raw_labels (Dict[str, dict]) – {metal: [oxidationstates]}}

Returns

, ‘metal’:, ‘oxidationstate’:}]

Return type

List[dict] – list of dictionaries of the form [{‘name’

class oximachine_featurizer.featurize.GetFeatures(structure, outpath)[source]

Bases: object

Featurizer

__init__(structure, outpath)[source]

Generates features for a structures

Parameters
  • structure (Structure) – Pymatgen Structure object

  • outpath (Union[str, Path]) – path to which the features will be dumped

Returns:

__weakref__

list of weak references to the object (if defined)

property cutoff

Chose a cutoff for a given structure

property featurizer

Return the featurizer (with the suitable cutoff)

classmethod from_file(structurepath, outpath)[source]
Construct a featurizer class from path to structure

and an output path

Parameters
  • structurepath (Union[str, Path]) – Path to structure file

  • outpath (Union[str, Path]) – Path to which the outputs should be written.

Returns

Instance of the GetFeatures class

Return type

object

classmethod from_string(structurestring, outpath)[source]

Constructor for the webapp, using a string of a structure file, e.g., a CIF

Parameters
  • structurestring (str) – Fileconent of a CIF as string

  • outpath (Union[str, Path]) – Path to which the output should be written.

Raises

ValueError – In case the CIF could not be parsed

Returns

Instance of GetFeatures

Return type

object

return_features()[source]

Runs featurization and returns a list of dictionaries

Returns

List of dictionaries of the form {“metal”: , “feature”,, “coords”},

i.e features for one metal site

Return type

List[dict]

oximachine_featurizer.featurize.featurize(structure, featureset=['local_property_stats', 'column', 'row', 'valenceelectrons', 'diffto18electrons', 'sunfilled', 'punfilled', 'dunfilled', 'crystal_nn_no_steinhardt'])[source]

Finds metals in the structure, featurizes the metal sites and collects the features

Parameters
  • structure (pymatgen.Structure) – Structure to featurize

  • featureset (List[str]) – Features to be used in the final output

Returns

[description]

Return type

Union[np.array, list, list]

oximachine_featurizer.featurize.get_feature_names(selected_features, offset=0)[source]

Given a set of selected feature categories, return all feature names

Parameters
  • selected_features (List[str]) – feature categories

  • offset (int, optional) – To offset the feature ranges, to be used with RACs. Defaults to 0.

Returns

list of feature names

Return type

List[str]

The parsing module

Parsing functions for the oxidation state mining project

class oximachine_featurizer.parse.GetOxStatesCSD(cds_ids)[source]

Bases: object

Main parsing class

__init__(cds_ids)[source]

Parses CSD structures for oxidation states

Parameters

cds_ids (List[str]) – list of CSD database identifiers

Returns

None

__weakref__

list of weak references to the object (if defined)

parse_csd_entry(database_id)[source]

Looks up a CSD id and runs the parsing

Parameters

database_id (str) – CSD database identifier

Returns

symbol - oxidation state dictionary

Return type

dict

Exception:
returns empy dict upon exception

(if it cannot find the structure in the database)

parse_name(chemical_name_string)[source]

Takes the chemical name string from the CSD database and returns, if it finds it, a dictionary with the oxidation states for the metals

Parameters

chemical_name_string (str) – full chemical name

Returns

dictionary of symbol: oxidation states (list)

Return type

dict

run_parsing(njobs=4)[source]

Runs (concurrent) parsing over the list of database identifiers.

Parameters

njobs (int) – maximum number of parallel workers

Returns

nested dictionary

with {‘id’: {‘symbol’: [oxidation states]}}

Return type

Dict[str, dict]