scifact.model package

Submodules

scifact.model.download_pdf module

scifact.model.download_pdf.extract_ref_pdf(text)

Extract portion of the pdf that appears in the References section

Parameters

text (str) – contents of the pdf in str form

Returns

text of all the References found in the pdf

Return type

str

scifact.model.download_pdf.find_download_pdf(pdf_name, data)

Given a name of a pdf, downloads the pdf

Parameters
  • pdf_name (str) – name of the pdf to download which contains to claim

  • data (pandas dataframe) – arxiv dataset which contains the details of all pdfs and their authors, links etc

Returns

all the content/text found in the pdf

Return type

str

scifact.model.download_pdf.unzip(path_to_zip_file, dir_path)

Unzip a folder

Parameters
  • path_to_zip_file (str) – path to the zipped folder

  • dir_path (str) – path to place unzipped files

scifact.model.label module

scifact.model.label.Label_sentences(df)

Use the label_model and label the selected abstracts as Supports/Rejects

Parameters

df (pandas df) – df containing claim and sentences selected by the rationale model

Returns

labels predicted for each sentence selected by the rationale model

Return type

list

scifact.model.label.encode(sentences, claims, tokenizer)

Encode sentences and claim using the labeling model tokenizer

Parameters
  • sentences (str) – sentences selected by the pretrained rationale selection model that are most relevant to the claim

  • claim (str) – claim/query entered by the user

Returns

dict with tokenized claim and sentences that are most relavant to the claim

Return type

encoded_dict

scifact.model.pretrained_model module

class scifact.model.pretrained_model.rationale_label_selection

Bases: object

Cosine_Evidence_Selection(top_matches, df)

Given number of top matches and df containing the claim and sentences, find the most relevant sentences :param top_matches: int

Number of top sentences to find

Parameters

df – pd dataframe Dataframe containing claim/query and all cited document sentences

Returns

list of predicted sentences

abstract_selection(doc_query, references2, top_matches, data_copy)
Given a claim/query, text of all its citations, number of matches required and the arxiv dataset,

prints the abstracts

Parameters
  • doc_query (str) – user entered claim/query

  • references2 (str) – text of all the citations combined together

  • top_matches (int) – number of matching abstracts to extract

  • data_copy (pandas dataframe) – arxiv dataset which contains the details of all pdfs and their authors, links etc

Returns

call to function find_extracts_labels

download_all_ref_content(all_ref, references2, data_original)
Given a list of citations, download cited documents from the internet and combine them.

Example: If the provided citation list is [3,6,9], the function will search the references part of the primary pdf, locate the titles of the pdf corresponding to the 3rd, 6th and 9th citations, download them, preprocess them and combine them into a single str

Parameters
  • all_ref (list) – list of all the citation numbers

  • references2 (str) – References section of the primary pdf

  • data_original (pandas dataframe) – arxiv dataset which contains the details of all pdfs and their authors, links etc

Returns

ref_str which is a str of all the sentences from the different cited documents combined

Return type

str

Returns

ref_list which is a list of all the sentences from the different cited documents combined

Return type

list

find_extracts_labels(doc_query, all_ref_text, top_matches_entered)

Given a claim/query, text from cited documents and top matches, print the relevant sentences :param doc_query: user entered claim/query :type doc_query: str :param ref_text_list: list of all the sentences from the different cited documents combined :type ref_text_list: list :param top_matches_entered: number of relevant sentences to return :type top_matches_entered: int

preprocess_query(doc_query)

Given a claim/query, function finds the citation within the sentence. Example: If claim/query is “Covid spread through air[3,6,9] and transmits fast”; The function is able to find the citation numbers: [3,6,9] and return this as a list

Parameters
  • doc_query (str) – user entered claim/query

  • top_matches (int) – number of matching abstracts to extract

  • data_copy (pandas dataframe) – arxiv dataset which contains the details of all pdfs and their authors, links etc

Returns

all_ref which is a list of all the citation numbers

Return type

list

printwd()

Module contents