1

I've built a simple app in Python, with a front-end UI in Dash.

It relies on three files,

  1. small dataframe, in pickle format ,95KB
  2. large scipy sparse matrix, in NPZ format, 12MB
  3. large scikit KNN-model, in job lib format, 65MB

I have read in the first dataframe successfully by

link = 'https://github.com/user/project/raw/master/filteredDF.pkl'
df = pd.read_pickle(link)

But when I try this with the others, say, the model by:

mLink = 'https://github.com/user/project/raw/master/knnModel.pkl'
filehandler = open(mLink, 'rb') 
model_knn = pickle.load(filehandler)

I just get an error

Invalid argument: 'https://github.com/user/project/raw/master/knnModel45Percent.pkl'

I also pushed these files using Github LFS, but the same error occurs.

I understand that hosting large static files on github is bad practice, but I haven't been able to figure out how to use PyDrive or AWS S3 w/ my project. I just need these files to be read in by my project, and I plan to host the app on something like Heroku. I don't really need a full-on DB to store files. The best case would be if I could read in these large files stored in my repo, but if there is a better approach, I am willing as well. I spent the past few days struggling through Dropbox, Amazon, and Google Cloud APIs and am a bit lost. Any help appreciated, thank you.

2 Answers 2

3

Could you try the following?

from io import BytesIO
import pickle
import requests
mLink = 'https://github.com/aaronwangy/Kankoku/blob/master/filteredAnimeList45PercentAll.pkl?raw=true'
mfile = BytesIO(requests.get(mLink).content)
model_knn = pickle.load(mfile)

Using the BytesIO you create a file object out of the response that you get from GitHub. That object can then be using in pickle.load. Note that I have added ?raw=true to the URL of the request.

2
  • Yes, the reading of the pickle dataframe works great, perhaps due to its small size. The main issue lies with the joblib and npz files which are much larger, but still less than 75MB
    – AxW
    May 14, 2020 at 0:00
  • I am having the same issue. I have used your code but now get a KeyError10. Do you know why?
    – mblume
    Apr 14, 2022 at 8:10
0

For the ones having the KeyError 10 try

model_knn = joblib.load(mfile)

instead of

model_knn = pickle.load(mfile)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.