Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple missing values in BayesianNetwork.predict throws IndexError #1817

Open
vsocrates opened this issue Sep 20, 2024 · 2 comments · May be fixed by #1854
Open

Multiple missing values in BayesianNetwork.predict throws IndexError #1817

vsocrates opened this issue Sep 20, 2024 · 2 comments · May be fixed by #1854

Comments

@vsocrates
Copy link

vsocrates commented Sep 20, 2024

Subject of the issue

When there are multiple missing values in the DataFrame passedi nto the predict function for BayesianNetwork, it throws an error.

Your environment

  • pgmpy version: dev
  • Python version: 3.10.14
  • Operating System: Red Hat Enterprise Linux 8.8 (Ootpa)

Steps to reproduce

Using the documentation example:

import numpy as np
import pandas as pd
from pgmpy.models import BayesianNetwork

values = pd.DataFrame(np.random.randint(low=0, high=2, size=(1000, 5)),
                      columns=['A', 'B', 'C', 'D', 'E'])

train_data = values[:800]
predict_data = values[800:]
model = BayesianNetwork([('A', 'B'), ('C', 'B'), ('C', 'D'), ('B', 'E')])
model.fit(train_data)
predict_data = predict_data.copy()
predict_data.drop('E', axis=1, inplace=True)

# randomly add some 
mask = np.random.choice([True, False], size=predict_data.shape)
mask[mask.all(1),-1] = 0
predict_data = predict_data.mask(mask)

# predict throws error
y_pred = model.predict(predict_data)

Expected behaviour

Shouldn't throw an error.

Actual behaviour

Throws an IndexError when it gets down to DiscreteFactor:

WARNING:pgmpy:Found unknown state name. Trying to switch to using all state names as state numbers

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[1], line 21
     18 predict_data = predict_data.mask(mask)
     20 # predict throws error
---> 21 y_pred = model.predict(predict_data)

File [~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/models/BayesianNetwork.py:730](https://ood-mccleary.ycrc.yale.edu/node/r102u31n01.mccleary.ycrc.yale.edu/37837/lab/tree/Documents/Uncertainty_Network/~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/models/BayesianNetwork.py#line=729), in BayesianNetwork.predict(self, data, stochastic, n_jobs)
    727 pred_values = []
    729 # Send state_names dict from one of the estimated CPDs to the inference class.
--> 730 pred_values = Parallel(n_jobs=n_jobs)(
    731     delayed(model_inference.map_query)(
    732         variables=missing_variables,
    733         evidence=data_point.to_dict(),
    734         show_progress=False,
    735     )
    736     for index, data_point in tqdm(
    737         data_unique.iterrows(), total=data_unique.shape[0]
    738     )
    739 )
    741 df_results = pd.DataFrame(pred_values, index=data_unique.index)
    742 data_with_results = pd.concat([data_unique, df_results], axis=1)
...
File [~/.conda/envs/bnlearn/lib/python3.10/site-packages/joblib/parallel.py:1918](https://ood-mccleary.ycrc.yale.edu/node/r102u31n01.mccleary.ycrc.yale.edu/37837/lab/tree/Documents/Uncertainty_Network/~/.conda/envs/bnlearn/lib/python3.10/site-packages/joblib/parallel.py#line=1917), in Parallel.__call__(self, iterable)
   1916     output = self._get_sequential_output(iterable)
   1917     next(output)
-> 1918     return output if self.return_generator else list(output)
   1920 # Let's create an ID that uniquely identifies the current call. If the
   1921 # call is interrupted early and that the same instance is immediately
   1922 # re-used, this id will be used to prevent workers that were
   1923 # concurrently finalizing a task from the previous call to run the
   1924 # callback.
   1925 with self._lock:
File [~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/factors/discrete/DiscreteFactor.py:569](https://ood-mccleary.ycrc.yale.edu/node/r102u31n01.mccleary.ycrc.yale.edu/37837/lab/tree/Documents/Uncertainty_Network/~/.conda/envs/bnlearn/lib/python3.10/site-packages/pgmpy/factors/discrete/DiscreteFactor.py#line=568), in DiscreteFactor.reduce(self, values, inplace, show_warnings)
    567 phi.cardinality = phi.cardinality[var_index_to_keep]
    568 phi.del_state_names([var for var, _ in values])
--> 569 phi.values = phi.values[tuple(slice_)]
    571 if not inplace:
    572     return phi

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Solution

This is an easy fix. I just changed the following line:

evidence=data_point.to_dict(),

to

evidence=data_point[~data_point.isna()].to_dict(),

Two things to note:

  1. It needs to be changed for both the stochastic and non-stochastic cases, as well as in predict_probability, I'm assuming.
  2. We probably want to put in the documentation that we will use available evidence, but will only be predicting for the missing_variables, not all missing variables. This could alternatively be changed to fill in any values that are missing for the entire DataFrame.
@ankurankan
Copy link
Member

@vsocrates Thanks a lot for reporting this. This should indeed have been clearer in the documentation. I also really like your idea of just filling in all the missing values in the given dataframe. I will try to implement that and will try to figure out the best way to deal with that in case of predict_probability.

@vsocrates
Copy link
Author

For predict_probability, I believe the same change can be made here:

evidence=data_point.to_dict(),

Apologies, I don't have the time to add tests etc. otherwise, I would make a pull request myself. Thanks for all your work on this library!

@Nimish-4 Nimish-4 linked a pull request Oct 28, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants