Skip to content

Hello World and More

Tammy Yang edited this page Feb 19, 2020 · 1 revision

Basic Example

  1. Load json

    you can load json by your own method, or use the function we write for Facebook json to handling mojibake.

    from fbjson2table.func_lib import parse_fb_json                             
                                                                                
    json_content = parse_fb_json($PATH_OF_JSON)                                 
    
  2. Feed it into "TempDFs", and take a look of "TempDFs.df_list" and "TempDFs.table_name_list",

    from tabulate import tabulate                                               
    from fbjson2table.table_class import TempDFs                                
                                                                                
    temp_dfs = TempDFs(json_content)                                            
    for df, table_name in zip(temp_dfs.df_list, temp_dfs.table_name_list):      
        print(table_name, ':')                                                  
        print(tabulate(df, headers='keys', tablefmt='psql'), '\n')              
    

    here is example of json_content

    here is example of TempDFs.df_list and TempDFs.table_name_list

    a bit more explanation:

    Every df has its own name, the default name of the root DataFrame is "temp", and the names of the sub-df are called "${NAME_OF_ROOT_DF}__DICT_KEY".

    After the json is flattened, each layer will have its own id. The total numbers of layer ids should equal to the "depth(peeling)" of the original json. The id of first depth is always called "id_0", and the following id is called "id_${DICT_KEY_DEPTH}" such as "id_attachment_1".

    With the ids, we can do the "join" operation. For example, if we want to put "uri" of "media" and "timestamp" of posts in same table, the code will like:

    top_df = temp_posts_dfs[0].set_index("id_0", drop=False)                    
    append_df = temp_posts_df[4].set_index("id_0", drop=False)                  
                                                                                
    wanted_df = top_df.join(append_df) # What we want                           
    

    If you are too lazy to find where is the data you want, and you are sure that the data is one-to-one relationship with "top_df", you can use "merge_one_to_one_sub_df."

    For example:

    one_to_one_df = temp_dfs.merge_one_to_one_sub_df(                           
                    temp_dfs.df_list,                                           
                    temp_dfs.table_name_list,                                   
                    temp_dfs.id_column_names_list,                              
                    start_peeling=0) # start_peeling is the index of df we want to set as "top_df" in df_list
    

    note: in the "one_to_one_df", all column names of sub dfs will concat its depth dict key as prefix. For example, "id_media_3" => "media_id_media_3".

Get Columns You Want

  1. Create a one-to-one DataFrame
from fbjson2table.func_lib import parse_fb_json                              
from fbjson2table.table_class import TempDFs                                 
                                                                             
                                                                             
json_content = parse_fb_json($PATH_OF_JSON)                                  
temp_dfs = TempDFs(json_content)                                             
one_to_one_df, _ = temp_dfs.temp_to_wanted_df(                               
              wanted_columns=[]                                              
              )                                                              

Take a look of one_to_one_df, and determine which columns we want.

print(one_to_one_df.columns)                                                 
  1. From the full table, get only the wanted columns
# You will need to pre-define the LIST_OF_WANTED_COLUMNS                                                                          
wanted_columns = LIST_OF_WANTED_COLUMNS                                      
                                                                             
df, top_id = temp_dfs.temp_to_wanted_df(                                     
         wanted_columns=wanted_columns                                       
     )                                                                       

The "df" is what we can use to analyze; simple, easy, and with only columns we need.

Clone this wiki locally