Multiple missing values in BayesianNetwork.predict throws IndexError #1817

vsocrates · 2024-09-20T21:08:10Z

Subject of the issue

When there are multiple missing values in the DataFrame passedi nto the predict function for BayesianNetwork, it throws an error.

Your environment

pgmpy version: dev
Python version: 3.10.14
Operating System: Red Hat Enterprise Linux 8.8 (Ootpa)

Steps to reproduce

Using the documentation example:

import numpy as np
import pandas as pd
from pgmpy.models import BayesianNetwork

values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 5)),
                      columns=['A', 'B', 'C', 'D', 'E'])

train_data = values[:800]
predict_data = values[800:]
model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D'), ('B', 'E')])
model.fit(train_data)
predict_data = predict_data.copy()
predict_data.drop('E', axis=1, inplace=True)

# randomly add some 
mask = np.random.choice([True, False], size=predict_data.shape)
mask[mask.all(1),-1] = 0
predict_data = predict_data.mask(mask)

# predict throws error
y_pred = model.predict(predict_data)

Expected behaviour

Shouldn't throw an error.

Actual behaviour

Throws an IndexError when it gets down to DiscreteFactor:

WARNING:pgmpy:Found unknown state name. Trying to switch to using all state names as state numbers

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 21
     18 predict_data = predict_data.mask(mask)
     20 # predict throws error
---> 21 y_pred = model.predict(predict_data)

File [~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/models/BayesianNetwork.py:730](https://ood-mccleary.ycrc.yale.edu/node/r102u31n01.mccleary.ycrc.yale.edu/37837/lab/tree/Documents/Uncertainty_Network/~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/models/BayesianNetwork.py#line=729), in BayesianNetwork.predict(self, data, stochastic, n_jobs)
    727 pred_values = []
    729 # Send state_names dict from one of the estimated CPDs to the inference class.
--> 730 pred_values = Parallel(n_jobs=n_jobs)(
    731     delayed(model_inference.map_query)(
    732         variables=missing_variables,
    733         evidence=data_point.to_dict(),
    734         show_progress=False,
    735     )
    736     for index, data_point in tqdm(
    737         data_unique.iterrows(), total=data_unique.shape[0]
    738     )
    739 )
    741 df_results = pd.DataFrame(pred_values, index=data_unique.index)
    742 data_with_results = pd.concat([data_unique, df_results], axis=1)
...
File [~/.conda/envs/bnlearn/lib/python3.10/site-packages/joblib/parallel.py:1918](https://ood-mccleary.ycrc.yale.edu/node/r102u31n01.mccleary.ycrc.yale.edu/37837/lab/tree/Documents/Uncertainty_Network/~/.conda/envs/bnlearn/lib/python3.10/site-packages/joblib/parallel.py#line=1917), in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:
File [~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/factors/discrete/DiscreteFactor.py:569](https://ood-mccleary.ycrc.yale.edu/node/r102u31n01.mccleary.ycrc.yale.edu/37837/lab/tree/Documents/Uncertainty_Network/~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/factors/discrete/DiscreteFactor.py#line=568), in DiscreteFactor.reduce(self, values, inplace, show_warnings)
    567 phi.cardinality = phi.cardinality[var_index_to_keep]
    568 phi.del_state_names([var for var, _ in values])
--> 569 phi.values = phi.values[tuple(slice_)]
    571 if not inplace:
    572     return phi

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Solution

This is an easy fix. I just changed the following line:

pgmpy/pgmpy/models/BayesianNetwork.py

Line 710 in 5f52e03

evidence=data_point.to_dict(),

to

evidence=data_point[~data_point.isna()].to_dict(),

Two things to note:

It needs to be changed for both the stochastic and non-stochastic cases, as well as in predict_probability, I'm assuming.
We probably want to put in the documentation that we will use available evidence, but will only be predicting for the missing_variables, not all missing variables. This could alternatively be changed to fill in any values that are missing for the entire DataFrame.

The text was updated successfully, but these errors were encountered:

ankurankan · 2024-09-21T10:31:43Z

@vsocrates Thanks a lot for reporting this. This should indeed have been clearer in the documentation. I also really like your idea of just filling in all the missing values in the given dataframe. I will try to implement that and will try to figure out the best way to deal with that in case of predict_probability.

vsocrates · 2024-09-22T22:17:26Z

For predict_probability, I believe the same change can be made here:

pgmpy/pgmpy/models/BayesianNetwork.py

Line 808 in bb8e328

evidence=data_point.to_dict(),

Apologies, I don't have the time to add tests etc. otherwise, I would make a pull request myself. Thanks for all your work on this library!

ankurankan added the Enhancement label Sep 21, 2024

Nimish-4 linked a pull request Oct 28, 2024 that will close this issue

Modified Bayesian Network's predict method to handle NaNs #1854

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple missing values in BayesianNetwork.predict throws IndexError #1817

Multiple missing values in BayesianNetwork.predict throws IndexError #1817

vsocrates commented Sep 20, 2024 •

edited

Loading

ankurankan commented Sep 21, 2024

vsocrates commented Sep 22, 2024

Multiple missing values in BayesianNetwork.predict throws IndexError #1817

Multiple missing values in BayesianNetwork.predict throws IndexError #1817

Comments

vsocrates commented Sep 20, 2024 • edited Loading

Subject of the issue

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

Solution

ankurankan commented Sep 21, 2024

vsocrates commented Sep 22, 2024

vsocrates commented Sep 20, 2024 •

edited

Loading