scifact.tasks package

Submodules

scifact.tasks.arxiv_preprocess module

scifact.tasks.arxiv_preprocess.arxiv_clean(df)

Clean the arxiv dataset to be used for searching of pdf links

Parameters

df (pandas dataframe) – arxiv dataset which contains the details of all pdfs and their authors, links etc

Returns

cleaned arxiv dataset

Return type

pandas dataframe

scifact.tasks.arxiv_preprocess.rem_bracket(line)
scifact.tasks.arxiv_preprocess.rem_unwanted(line)

scifact.tasks.download module

class scifact.tasks.download.ContentData(*args, **kwargs)

Bases: luigi.task.ExternalTask

DATA_ROOT = 's3://advancedpythonmeenu/scifact/'
client = <luigi.contrib.s3.S3Client object>
data_name = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

class scifact.tasks.download.DownloadData(*args, **kwargs)

Bases: luigi.task.Task

LOCAL_ROOT = '/Users/meenu/Desktop/Harvard/AdvancedPython/Assignments/Pset3/2021sp-scifact-lalitanjali-ai/docsrc/data'
S3_ROOT = 's3://advancedpythonmeenu/scifact/'
SHARED_RELATIVE_PATH = 'dataset/'
data = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

path = '/Users/meenu/Desktop/Harvard/AdvancedPython/Assignments/Pset3/2021sp-scifact-lalitanjali-ai/docsrc/data/dataset/'
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

class scifact.tasks.download.DownloadModel(*args, **kwargs)

Bases: luigi.task.Task

LOCAL_ROOT = '/Users/meenu/Desktop/Harvard/AdvancedPython/Assignments/Pset3/2021sp-scifact-lalitanjali-ai/docsrc/data'
S3_ROOT = 's3://advancedpythonmeenu/scifact/'
SHARED_RELATIVE_PATH = 'saved_models/'
model = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

path = '/Users/meenu/Desktop/Harvard/AdvancedPython/Assignments/Pset3/2021sp-scifact-lalitanjali-ai/docsrc/data/saved_models/'
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

class scifact.tasks.download.SavedModel(*args, **kwargs)

Bases: luigi.task.ExternalTask

MODEL_ROOT = 's3://advancedpythonmeenu/scifact/'
client = <luigi.contrib.s3.S3Client object>
model = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

scifact.tasks.tasks module

class scifact.tasks.tasks.Preprocess_arxiv_data(*args, **kwargs)

Bases: luigi.task.Task

LOCAL_ROOT = '/Users/meenu/Desktop/Harvard/AdvancedPython/Assignments/Pset3/2021sp-scifact-lalitanjali-ai/docsrc/data'
SHARED_RELATIVE_PATH = 'dataset/'
ext = '.csv'
glob1 = 'preprocessed_arXivData.csv'
glob_path = <luigi.parameter.Parameter object>
output()

The output that this Task produces.

The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.

Implementation note

If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.

See Task.output

path = '/Users/meenu/Desktop/Harvard/AdvancedPython/Assignments/Pset3/2021sp-scifact-lalitanjali-ai/docsrc/data/dataset/'
requirement = None
requires()dict

Returns the requirements of a task

Assumes the task class has Requirement descriptors, which can clone the appropriate dependences from the task instance.

Returns

requirements compatible with task.requires()

Return type

dict

run()

The task run method, to be overridden in a subclass.

See Task.run

class scifact.tasks.tasks.find_display_abstracts(*args, **kwargs)

Bases: luigi.task.Task

doc_query = <luigi.parameter.Parameter object>
model_label = 'label_roberta_large_fever_scifact.zip'
model_rationale = 'rationale_roberta_large_fever.zip'
pdf_name = <luigi.parameter.Parameter object>
requires()

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

run()

The task run method, to be overridden in a subclass.

See Task.run

top_matches = <luigi.parameter.IntParameter object>

Module contents